CN113870395A

CN113870395A - Animation video generation method, device, equipment and storage medium

Info

Publication number: CN113870395A
Application number: CN202111152667.XA
Authority: CN
Inventors: 郑喜民; 陈振宏; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-31
Also published as: WO2023050650A1

Abstract

The invention relates to artificial intelligence and provides a method, a device, equipment and a storage medium for generating an animation video. The method can acquire text information according to a video generation request when the video generation request is received; inputting the text information into a video generation model trained in advance to obtain an initial video; identifying human body feature points of each frame of image in the initial video; generating the posture information of the user in each frame of image according to the human body feature points; if the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video; analyzing the text information based on a pre-trained audio generation model to obtain audio information; and generating an animation video according to the second video and the audio information. The invention can improve the generation efficiency and the generation quality of the animation video. Furthermore, this disclosure also relates to blockchain techniques, which animation video may be stored in blockchains.

Description

Animation video generation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an animation video generation method, device, equipment and storage medium.

Background

In the educational scene facing children students, the animation video teaching can stimulate the learning interest and enthusiasm of the students. With the development of artificial intelligence, animation video teaching is also developed. However, in the current animation video generation process, steps of writing of story scripts, designing of split mirror images, shooting of live-action broadcasts, drawing of picture-inserted materials, animation production, post-editing and the like are involved, so that the generation efficiency of the complete animation video is low, and in addition, different video production users cannot ensure the generation quality of the video due to different views of the video production.

Disclosure of Invention

In view of the above, it is desirable to provide a method, an apparatus, a device and a storage medium for generating a motion picture video, which can improve the generation efficiency and the generation quality of the motion picture video.

In one aspect, the present invention provides an animation video generation method, where the animation video generation method includes:

when a video generation request is received, acquiring text information according to the video generation request;

inputting the text information into a video generation model trained in advance to obtain an initial video;

identifying human body feature points of each frame of image in the initial video;

generating the posture information of the user in each frame of image according to the human body feature points;

if the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video;

analyzing the text information based on a pre-trained audio generation model to obtain audio information;

and generating an animation video according to the second video and the audio information.

According to the preferred embodiment of the present invention, before the text information is input into a video generation model trained in advance to obtain an initial video, the method further includes:

obtaining a plurality of video training samples, wherein each video training sample comprises a training video and a training text corresponding to the training video;

constructing a learner, wherein the learner comprises an encoding layer and a decoding layer;

performing text coding processing on the training text to obtain a text vector;

analyzing the text vector based on the coding layer to obtain the characteristic information of the training text;

analyzing the characteristic information based on the decoding layer to obtain an output vector;

mapping the training video based on a preset mapping table to obtain an image vector of the training video;

calculating the similarity between the text vector and the output vector to obtain a first similarity, and calculating the similarity between the text vector and the image vector to obtain a second similarity;

calculating the ratio of the first similarity in the second similarity to obtain a learning index of the learner;

and adjusting the network parameters in the learner until the learning index is not increased any more, so as to obtain the video generation model.

According to a preferred embodiment of the present invention, the identifying the human body feature points of each frame of image in the initial video comprises:

detecting each frame of image based on a preset detector to obtain a human body area in each frame of image;

carrying out gray level processing on the human body area to obtain a plurality of pixel points of the human body area and a pixel gray level value corresponding to each pixel point;

calculating a pixel difference value of each pixel point and a preset characteristic point according to the pixel gray value and a characteristic gray value of the preset characteristic point;

determining the pixel points with the pixel difference value smaller than a preset threshold value as initial characteristic points;

constructing a coordinate system based on each frame of image, and acquiring initial coordinate information of the initial characteristic points on each frame of image;

and screening the human body characteristic points from the initial characteristic points according to the initial coordinate information.

According to a preferred embodiment of the present invention, the screening the human body feature points from the initial feature points according to the initial coordinate information includes:

for any initial feature point, calculating the feature distance between the any initial feature point and a target feature point according to the initial coordinate information, wherein the target feature point refers to the rest feature points except the any initial feature point in the initial feature points;

determining the characteristic distance with the minimum value as a target distance, and determining a target characteristic point corresponding to the target distance as an adjacent characteristic point of any initial characteristic point;

carrying out normal distribution processing on the target distance to obtain a probability value of the target distance;

and determining the initial characteristic point corresponding to the target distance with the probability value larger than the preset probability value as the human body characteristic point.

According to the preferred embodiment of the present invention, the generating of the pose information of the user in each frame of image according to the human body feature points includes:

acquiring coordinate information of the human body characteristic points as human body coordinate information according to the coordinate system;

acquiring any two adjacent characteristic points from the human body characteristic points as characteristic point pairs;

calculating the Euler angle of each characteristic point pair according to the human body coordinate information and the abscissa axis in the coordinate system;

and calculating the average value of the Euler angles to obtain an angle average value, and determining preset posture information corresponding to the angle average value as the posture information.

According to the preferred embodiment of the present invention, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the analyzing the text information based on the pre-trained audio generation model to obtain the audio information includes:

analyzing the text information based on the emotion recognition network layer to obtain the text emotion of the text information;

obtaining the emotional voice characteristics of the text emotion from a voice characteristic library;

processing the text information based on the voice conversion network layer to obtain voice information, and acquiring text voice characteristics in the voice information;

and carrying out audio mixed flow processing on the text voice characteristics and the emotion voice characteristics to obtain the audio information.

According to a preferred embodiment of the present invention, the generating an animation video according to the second video and the audio information comprises:

counting the duration of the second video to obtain a first duration;

counting the duration of the audio information to obtain a second duration;

if the first duration is not equal to the second duration, acquiring information with the maximum duration from the second video and the audio information as information to be processed;

compressing the information to be processed until the time lengths of the processed second video and the processed audio information are equal;

and combining the processed second video and the processed audio information to obtain the animation video.

In another aspect, the present invention further provides an animation video generation device, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring text information according to a video generation request when the video generation request is received;

the input unit is used for inputting the text information into a pre-trained video generation model to obtain an initial video;

the identification unit is used for identifying human body characteristic points of each frame of image in the initial video;

the generating unit is used for generating the posture information of the user in each frame of image according to the human body characteristic points;

the adjusting unit is used for adjusting the posture information according to the human body characteristic points to obtain a second video if the posture information is a preset posture;

the analysis unit is used for analyzing the text information based on a pre-trained audio generation model to obtain audio information;

and the generating unit is used for generating an animation video according to the second video and the audio information.

In another aspect, the present invention further provides an electronic device, including:

a memory storing computer readable instructions; and

a processor executing computer readable instructions stored in the memory to implement the animated video generation method.

In another aspect, the present invention also provides a computer-readable storage medium, in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the animation video generation method.

The technical proposal shows that the invention can rapidly generate the initial video by analyzing the text information through the video generation model, thereby improving the generation efficiency of the animation video, and further, by identifying the human body characteristic points, the posture information of the user in each frame of image can be accurately determined, and when the posture information is a preset posture, the attitude information is adjusted, so that the condition that the second video has bad attitude information such as the preset attitude can be avoided, since good pose information can provide a certain educational effect to the user, the quality of the second video can be improved by avoiding the existence of bad pose information such as the preset pose in the second video, the audio generation model can accurately generate the audio information corresponding to the text information, according to the audio information and the second video, the generation quality of the animation video can be improved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the animation video generation method of the present invention.

FIG. 2 is a functional block diagram of a preferred embodiment of the animation video generation device of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device implementing the animation video generation method according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of the method for generating an animation video according to the preferred embodiment of the invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.

The animation video generation method can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The animation video generation method is applied to the field of intelligent education, and therefore development of smart cities is promoted. The animation video generation method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to computer readable instructions set or stored in advance, and the hardware of the electronic devices includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), a smart wearable device, and the like.

The electronic device may include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network electronic device, an electronic device group consisting of a plurality of network electronic devices, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network electronic devices.

The network in which the electronic device is located includes, but is not limited to: the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.

And S10, when a video generation request is received, acquiring text information according to the video generation request.

In at least one embodiment of the present invention, the triggering user of the video generation request differs according to the application scenario of the video generation request, for example, if the application scenario of the video generation request is in the field of education, the triggering user of the video generation request may be a teacher person or the like.

The video generation request may include, but is not limited to: text path, preset label, etc.

The text information refers to text information to be converted into video, and for example, the text information may be a lecture of a teacher.

In at least one embodiment of the present invention, the acquiring, by the electronic device, text information according to the video generation request includes:

analyzing the message of the video generation request to obtain the data information carried by the message;

extracting the text path from the data information according to the preset label;

and acquiring the text information from the text path.

The preset label is a label for indicating a path. For example, the preset tag may be storage location.

The text path can be accurately extracted through the preset label, so that the text information can be accurately acquired, and generation of a corresponding animation video is facilitated.

And S11, inputting the text information into a pre-trained video generation model to obtain an initial video.

In at least one embodiment of the present invention, the video generation model refers to a model capable of converting text into video. The video generation model comprises a coding layer, a decoding layer, a preset mapping table and the like. And the preset mapping table stores the mapping relation between the pixel values and the vectors.

The initial video is a video generated after the text information is analyzed by the video generation model. No voice information is contained in the initial video.

In at least one embodiment of the present invention, before the text information is input into a video generation model trained in advance to obtain an initial video, the method further includes:

performing text coding processing on the training text to obtain a text vector;

Wherein the text vector is used to characterize the training text.

The learning index is used to evaluate the accuracy of the learner.

The network parameters comprise preset parameters in the coding layer and the decoding layer. For example, if the coding layer includes a convolutional layer, the network parameter may be the size of a convolutional kernel in the convolutional layer.

The learning index is generated according to the similarity between the training text and the prediction video and the similarity between the training text and the training video, and then the network parameters are adjusted according to the learning index, so that the representation capability of the video generation model on the text information can be improved, and the accuracy of video generation is improved.

In at least one embodiment of the present invention, a manner in which the electronic device analyzes the text information based on the video generation model is similar to a manner in which the electronic device analyzes the training text based on the learner, which is not described in detail herein.

And S12, identifying the human body feature points of each frame of image in the initial video.

In at least one embodiment of the present invention, the human feature points include, but are not limited to: face key feature points, for example: the center of the pupil, etc.; hand joint points, bone joint points, and the like.

In at least one embodiment of the present invention, the electronic device identifying the human body feature points of each frame of image in the initial video comprises:

Wherein the preset detector may be used to identify the personal information in the image.

The preset feature points comprise hand joint points, skeleton joint points and the like. The characteristic gray value can be determined according to pixel information corresponding to preset characteristic points of a plurality of preset users.

The preset threshold value can be set according to requirements.

The coordinate system includes an abscissa axis and an ordinate axis.

Through preset detector detects every frame image, not only can reject the interference of background information to human characteristic point in every frame image to improve human characteristic point's identification accuracy, can also reduce the analysis quantity of pixel, thereby improve human characteristic point's recognition efficiency, and then pass through pixel grey scale value with the analysis of characteristic grey scale value, can confirm fast initial characteristic point, and then according to initial coordinate information of initial characteristic point can improve the accuracy of confirming human characteristic point.

In at least one embodiment of the present invention, the electronic device, according to the initial coordinate information, screening the human body feature points from the initial feature points includes:

The preset probability value may be set according to a requirement, for example, the preset probability value may be 99.44%.

The adjacent characteristic points of any initial characteristic point can be quickly determined through the analysis of the characteristic distance, the probability value of the target distance is further analyzed through the normal distribution processing of the target distance, and the human body characteristic points can be accurately screened out from the initial characteristic points.

And S13, generating the posture information of the user in each frame of image according to the human body feature points.

In at least one embodiment of the present invention, the posture information refers to the posture of the user in each frame of image, for example, the posture information may be head-down, the posture information may be head-up, and the like.

In at least one embodiment of the present invention, the generating, by the electronic device, the pose information of the user in each frame of image according to the human body feature point includes:

The feature point pair is obtained by acquiring any two adjacent feature points from the human body feature points, and further, the any two adjacent feature points are human body feature points with adjacent feature distances, such as a human body feature point a, a human body feature point B, a human body feature point C, and a human body feature point D, where the feature distance between the human body feature point a and the human body feature point B is 5, the feature distance between the human body feature point a and the human body feature point C is 2, and the feature distance between the human body feature point a and the human body feature point D is 3, and then the human body feature point C is an adjacent feature point of the human body feature point a.

By calculating Euler angles of any two adjacent characteristic points, the interference of a human body characteristic point with a long distance to the attitude information can be avoided, and therefore the accuracy of determining the attitude information is improved.

Specifically, the posture information may be determined according to a mapping table of angles and preset posture information. Wherein the preset gesture information may be annotated by a user.

And S14, if the posture information is a preset posture, adjusting the posture information according to the human body feature points to obtain a second video.

In at least one embodiment of the present invention, the preset gesture may include, but is not limited to: head-lowering, head-raising, etc.

The user posture of each frame of image in the second video is not the preset posture.

In at least one embodiment of the present invention, the adjusting, by the electronic device, the posture information according to the human body feature point to obtain a second video includes:

acquiring a posture angle of a standard posture from a posture mapping table;

comparing the angle mean to the pose angle;

if the angle mean value is larger than the posture angle, comparing the Euler angle with the angle mean value;

acquiring a characteristic point pair corresponding to the Euler angle with the value larger than the angle mean value as a characteristic point to be processed;

and adjusting the position of the feature point to be processed in the image until the adjusted posture information is not a preset posture, and obtaining the second video.

The posture mapping table stores mapping relations between a plurality of preset posture information and angles, wherein the preset posture information comprises the standard posture, the pitching bad posture and other bad postures. The preset posture information in the posture mapping table can be labeled by a user, and the calculation mode of the angle in the posture mapping table is similar to the calculation mode of the angle mean value in each frame of image, which is not described again in the invention.

Through the comparison of the angle mean value and the posture angle and the comparison of the Euler angle and the angle mean value, the human characteristic points influencing the posture information can be quickly determined, and then adjustment is carried out, so that the quality of the second video is improved.

And S15, analyzing the text information based on the pre-trained audio generation model to obtain audio information.

In at least one embodiment of the invention, the audio generation model is used to convert the textual information to speech.

The audio information refers to a voice corresponding to the text information.

In at least one embodiment of the present invention, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the analyzing, by the electronic device, the text information based on the pre-trained audio generation model to obtain the audio information includes:

And the emotion recognition network layer is used for analyzing the emotion corresponding to the text. The textual emotion may be happy, sad, etc.

The voice conversion network layer is used for converting the text into voice.

And audio mixed flow processing is carried out on the text voice characteristic and the emotion voice characteristic, so that the audio information contains the text emotion, and the interestingness of the audio information is improved.

And S16, generating an animation video according to the second video and the audio information.

In at least one embodiment of the present invention, the animation video refers to a video including the audio information and the second video.

It is emphasized that, in order to further ensure the privacy and security of the animation video, the animation video may also be stored in a node of a blockchain.

In at least one embodiment of the present invention, the electronic device generating an animation video according to the second video and the audio information includes:

counting the duration of the second video to obtain a first duration;

counting the duration of the audio information to obtain a second duration;

By the above embodiment, when the first duration is not equal to the second duration, the information with the largest duration is compressed, so that the durations of the processed second video and the processed audio information can be ensured to be equal, and the processed second video and the processed audio information can be conveniently and directly merged, thereby improving the generation efficiency of the animation video.

Specifically, the merging, by the electronic device, the processed second video and the processed audio information to obtain the animation video includes:

acquiring sound track information of the processed second video on sound track dimension;

and replacing the sound track information with the processed audio information to obtain the animation video.

The animation video can be generated quickly by replacing the soundtrack information with the processed audio information.

Fig. 2 is a functional block diagram of a preferred embodiment of the animation video generation device according to the invention. The animation video generation device 11 includes an acquisition unit 110, an input unit 111, a recognition unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118, and a calculation unit 119. The module/unit referred to herein is a series of computer readable instruction segments that can be accessed by the processor 13 and perform a fixed function and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.

When receiving a video generation request, the acquisition unit 110 acquires text information according to the video generation request.

In at least one embodiment of the present invention, the obtaining unit 110 obtains the text information according to the video generation request, including:

and acquiring the text information from the text path.

The input unit 111 inputs the text information into a video generation model trained in advance to obtain an initial video.

In at least one embodiment of the present invention, before the text information is input into a video generation model trained in advance to obtain an initial video, the obtaining unit 110 obtains a plurality of video training samples, where each video training sample includes a training video and a training text corresponding to the training video;

the construction unit 116 constructs a learner, wherein the learner comprises an encoding layer and a decoding layer;

the encoding unit 117 performs text encoding processing on the training text to obtain a text vector;

the analysis unit 115 analyzes the text vector based on the coding layer to obtain feature information of the training text;

the analysis unit 115 analyzes the feature information based on the decoding layer to obtain an output vector;

the mapping unit 118 performs mapping processing on the training video based on a preset mapping table to obtain an image vector of the training video;

the calculating unit 119 calculates a similarity between the text vector and the output vector to obtain a first similarity, and calculates a similarity between the text vector and the image vector to obtain a second similarity;

the calculating unit 119 calculates a ratio of the first similarity to the second similarity to obtain a learning index of the learner;

the adjusting unit 114 adjusts the network parameters in the learner until the learning index does not increase any more, resulting in the video generation model.

Wherein the text vector is used to characterize the training text.

The learning index is used to evaluate the accuracy of the learner.

In at least one embodiment of the present invention, a manner of analyzing the text information based on the video generation model is similar to a manner of analyzing the training text based on the learner, and details thereof are not repeated herein.

The identifying unit 112 identifies the human body feature points of each frame of image in the initial video.

In at least one embodiment of the present invention, the identifying unit 112 identifies the human body feature points of each frame of image in the initial video, including:

The preset threshold value can be set according to requirements.

The coordinate system includes an abscissa axis and an ordinate axis.

In at least one embodiment of the present invention, the step of the identifying unit 112 screening the human body feature points from the initial feature points according to the initial coordinate information includes:

The generating unit 113 generates posture information of the user in each frame image from the human body feature points.

In at least one embodiment of the present invention, the generating unit 113 generates the pose information of the user in each frame image according to the human body feature points includes:

If the posture information is a preset posture, the adjusting unit 114 adjusts the posture information according to the human body feature point to obtain a second video.

In at least one embodiment of the present invention, the adjusting unit 114 adjusts the posture information according to the human body feature point, and obtaining the second video includes:

acquiring a posture angle of a standard posture from a posture mapping table;

comparing the angle mean to the pose angle;

The analysis unit 115 analyzes the text information based on a pre-trained audio generation model to obtain audio information.

The audio information refers to a voice corresponding to the text information.

In at least one embodiment of the present invention, the audio generation model includes an emotion recognition network layer and a speech conversion network layer, and the analyzing unit 115 analyzes the text information based on the pre-trained audio generation model, and obtaining the audio information includes:

The voice conversion network layer is used for converting the text into voice.

The generating unit 113 generates an animation video from the second video and the audio information.

In at least one embodiment of the present invention, the generating unit 113 generates an animation video according to the second video and the audio information, including:

counting the duration of the second video to obtain a first duration;

counting the duration of the audio information to obtain a second duration;

Specifically, the generating unit 113 merges the processed second video and the processed audio information to obtain the animation video includes:

Fig. 3 is a schematic structural diagram of an electronic device implementing the animation video generation method according to the preferred embodiment of the invention.

In one embodiment of the present invention, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and computer readable instructions, such as an animated video generating program, stored in the memory 12 and executable on the processor 13.

It will be appreciated by a person skilled in the art that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, and that it may comprise more or less components than shown, or some components may be combined, or different components, e.g. the electronic device 1 may further comprise an input output device, a network access device, a bus, etc.

The Processor 13 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The processor 13 is an operation core and a control center of the electronic device 1, and is connected to each part of the whole electronic device 1 by various interfaces and lines, and executes an operating system of the electronic device 1 and various installed application programs, program codes, and the like.

Illustratively, the computer readable instructions may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to implement the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing specific functions, which are used for describing the execution process of the computer readable instructions in the electronic device 1. For example, the computer-readable instructions may be divided into an acquisition unit 110, an input unit 111, a recognition unit 112, a generation unit 113, an adjustment unit 114, an analysis unit 115, a construction unit 116, an encoding unit 117, a mapping unit 118, and a calculation unit 119.

The memory 12 may be used for storing the computer readable instructions and/or modules, and the processor 13 implements various functions of the electronic device 1 by executing or executing the computer readable instructions and/or modules stored in the memory 12 and invoking data stored in the memory 12. The memory 12 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. The memory 12 may include non-volatile and volatile memories, such as: a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other storage device.

The memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a memory having a physical form, such as a memory stick, a TF Card (Trans-flash Card), or the like.

The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by hardware that is configured to be instructed by computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented.

Wherein the computer readable instructions comprise computer readable instruction code which may be in source code form, object code form, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying said computer readable instruction code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM).

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In conjunction with fig. 1, the memory 12 of the electronic device 1 stores computer-readable instructions to implement an animation video generation method, and the processor 13 can execute the computer-readable instructions to implement:

Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer readable instructions, which is not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The computer readable storage medium has computer readable instructions stored thereon, wherein the computer readable instructions when executed by the processor 13 are configured to implement the steps of:

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. The plurality of units or devices may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An animation video generation method, characterized by comprising:

2. The method of generating animated video according to claim 1, wherein before inputting the text information into a pre-trained video generation model to obtain an initial video, the method further comprises:

performing text coding processing on the training text to obtain a text vector;

3. The animated video generating method according to claim 1, wherein the identifying the human body feature points of each frame image in the initial video comprises:

4. The animation video generation method as claimed in claim 3, wherein the screening of the human body feature points from the initial feature points according to the initial coordinate information comprises:

5. The animated video generating method according to claim 3, wherein the generating of the pose information of the user in each frame of image according to the human body feature points comprises:

6. The method of claim 1, wherein the audio generation model comprises an emotion recognition network layer and a speech conversion network layer, and analyzing the text information based on the pre-trained audio generation model to obtain audio information comprises:

7. The method for generating an animated video according to claim 1, wherein the generating an animated video based on the second video and the audio information comprises:

counting the duration of the second video to obtain a first duration;

counting the duration of the audio information to obtain a second duration;

8. An animation video generation device, characterized by comprising:

9. An electronic device, characterized in that the electronic device comprises:

a memory storing computer readable instructions; and

a processor executing computer readable instructions stored in the memory to implement the animated video generating method of any one of claims 1 to 7.

10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein computer-readable instructions that are executed by a processor in an electronic device to implement the animated video generating method according to any one of claims 1 to 7.