CN118138855A

CN118138855A - Method and system for generating realistic video based on multiple artificial intelligence technologies

Info

Publication number: CN118138855A
Application number: CN202410383367.XA
Authority: CN
Inventors: 李佳楠
Original assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Current assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-06-04

Abstract

The application discloses a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which relate to the technical field of video generation and are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.

Description

Method and system for generating realistic video based on multiple artificial intelligence technologies

Technical Field

The application relates to the technical field of video generation, in particular to a method and a system for generating a realistic video based on various artificial intelligence technologies.

Background

With the continuous development of technology, artificial intelligence technology has also made significant progress. In artificial intelligence technology, large Language Models (LLMs) and Generative Antagonism Networks (GANs) are two important directions of research.

The large-scale language model refers to a model constructed based on a deep learning technology and capable of processing and generating large-scale natural language text. These models typically consist of billions or even billions of parameters that enable learning of patterns, rules, and contexts of language and generating text with verisimilitude and consistency.

The generative antagonism network is a deep learning model composed of a generator and a discriminant. The generator attempts to generate realistic data samples, such as images, audio or text, and the arbiter attempts to distinguish between the data generated by the generator and the real data. The generator and the arbiter compete and boost each other in a manner of countermeasure training, so that the generator can generate high-quality data samples which are difficult to distinguish from real data.

However, most LLM models are currently not directly applicable to the task of video generation. This is because: first, most existing LLMs are trained based on text data, while the content of the video is composed of images; second, GANs has been widely used in image-toimage task (refer to a task in which the generated content is in units of images, i.e., the input and output are still images), but their use in video-tovideo task (refer to another task in which the generated content is dynamic video, i.e., the input and output are video) is relatively small.

Disclosure of Invention

To this end, the present application provides a method, system and computer program product for generating a realistic video based on a plurality of artificial intelligence techniques, so as to solve the problem that the prior art cannot integrate the plurality of artificial intelligence techniques together to generate the realistic video.

In order to achieve the above object, the present application provides the following technical solutions:

in a first aspect, a method for generating a realistic video based on a plurality of artificial intelligence techniques, comprises:

step 1: receiving natural language data input by a user;

Step 2: preprocessing the natural language data;

Step 3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;

step 4: inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;

step 5: inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.

Optionally, in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.

Optionally, the text data and the real video data need to be cut, annotated and normalized.

Optionally, the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and accelerates the convergence rate of the model by reducing the training time.

Optionally, in step 3, the input parameters of the steady diffusion model include text data and corresponding video data when the steady diffusion model is trained, where the text data is optimized by using the trained large language model.

Optionally, in the step 3, the stable diffusion model changes the style and quality effect of the image by the diffusion step number, the sampling rate and the noise intensity during training.

In a second aspect, a system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:

the data receiving module is used for receiving natural language data input by a user;

the data processing module is used for preprocessing the natural language data;

The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;

the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;

And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.

In a third aspect, a computer device includes a memory storing a computer program and a processor implementing steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.

In a fourth aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.

In a fifth aspect, a computer program product comprises computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.

Compared with the prior art, the application has at least the following beneficial effects:

The application provides a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.

Drawings

In order to more intuitively illustrate the prior art and the application, exemplary drawings are presented below. It should be understood that the specific shape and configuration shown in the drawings are not generally considered limiting conditions in carrying out the application; for example, those skilled in the art will be able to make routine adjustments or further optimizations for the addition/subtraction/attribution division, specific shapes, positional relationships, connection modes, dimensional proportion relationships, and the like of certain units (components) based on the technical concepts and the exemplary drawings disclosed in the present application.

Fig. 1 is a flowchart of a method for generating a realistic video based on multiple artificial intelligence techniques according to a first embodiment of the present application.

Detailed Description

The application will be further described in detail by means of specific embodiments with reference to the accompanying drawings.

In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this disclosure are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on the degree of importance or order, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).

The terms such as "upper", "lower", "left", "right", "middle", and the like, as used herein, are generally used for the purpose of facilitating an intuitive understanding with reference to the drawings and are not intended to be an absolute limitation of the positional relationship in actual products.

Example 1

Referring to fig. 1, the present embodiment provides a method for generating a realistic video based on multiple artificial intelligence techniques, including:

s1: receiving natural language data input by a user;

specifically, the natural language data may be text or voice; if the natural language data is input through voice, the voice data needs to be automatically converted into text data. According to the embodiment, the data can be input through natural language, so that the requirements of different people can be met.

S2: preprocessing natural language data;

S3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;

Specifically, when the large language model is trained, a large amount of text data and corresponding real video data are firstly required to be obtained from various videos such as movies, television shows or advertisements according to the requirements of users to serve as a training set, and then the obtained text data and the corresponding real video data are cut, marked and normalized to obtain a final input training sample for storage for subsequent use. Wherein, the segmentation refers to uniformly dividing each frame of image into a plurality of small sub-blocks so as to facilitate the classification and use of data; marking is to mark the pictures of each small sub-as different categories or labels for model identification and learning respectively; the normalization is to convert each pixel value into a range between 0 and 255, so that the calculation by a computer is convenient, excessive errors can not occur, and the stability and the performance of the model are ensured.

In this embodiment, the large language model has strong natural language understanding capability and generating capability, and can accurately understand and analyze the content and requirements input by the user, and output corresponding scene content (i.e., storyline and character action commentary) according to the requirements. The accuracy of the model can be improved by adjusting parameters, such as the size of the corpus, and the convergence rate of the model can be increased by reducing training time.

S4: inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;

specifically, the input parameters of the Stable Diffusion (Stable Diffusion) model include text data and corresponding video data during training, wherein the text data is optimized data by using a trained large language model.

In this embodiment, stable Diffusion has a strong image generation capability, and can convert text data optimized by a large language model into a high-resolution image. Stable Diffusion changes the style and quality effect of the image by the number of Diffusion steps, sampling rate, and noise intensity during training.

S5: scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data.

Specifically, the XDSora model is based on the Sora model (i.e., the Jiang Wensheng-most Video model), and the Sora model is a Text to Video large model. Compared with Sora model, XDSora model enables the performance to be more lifelike in 2D digital human interaction scene by introducing a large amount of 2D high-definition human broadcasting materials as training sets. By optimizing the training set, XDSora models can better understand the text description and convert it to realistic video content. This improvement allows XDSora to have a higher realism and fidelity in rendering a 2D digital human interactive scene.

In this embodiment, the XDSora model has strong capability of understanding space-time relationship and extracting context information, and simultaneously considers the change in time dimension and the position relationship in space dimension, thereby realizing more real and natural video generation.

According to the method for generating the realistic video based on the multiple artificial intelligence technologies, the large-scale language model, the stable diffusion model and the XDSora model which are trained in advance are integrated into one unified frame, and when a user needs to generate the realistic video of the corresponding style, the user only needs to provide the needed content to the realistic video to obtain the corresponding video output. Through the precise matching and optimization of the three models, high-resolution and visual lifelike video content can be generated.

In summary, the method for generating a realistic video based on multiple artificial intelligence technologies provided by the embodiment has the following advantages:

(1) The generated video has high resolution, vivid shadow effect and rich details, and can meet the requirements of users on video quality;

(2) The role actions, scene switching and the like are similar to the real world, are difficult to distinguish from the real video, and have strong sense of realism;

(3) The method can generate video content of corresponding style according to different types of text descriptions input by a user so as to adapt to various requirements;

(4) Further optimization processes, such as changing character images, adjusting scene layout, etc., can be performed according to preferences for themselves, with adjustability.

Example two

The embodiment provides a system for generating a realistic video based on a plurality of artificial intelligence technologies, which comprises:

the data processing module is used for preprocessing the natural language data;

The specific implementation content of each module in a system for generating a realistic video based on multiple artificial intelligence techniques can be referred to as the definition of a method for generating a realistic video based on multiple artificial intelligence techniques, which is not described herein.

Example III

The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.

Example IV

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.

Example five

The present embodiments provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a variety of artificial intelligence techniques.

Any combination of the technical features of the above embodiments may be performed (as long as there is no contradiction between the combination of the technical features), and for brevity of description, all of the possible combinations of the technical features of the above embodiments are not described; these examples, which are not explicitly written, should also be considered as being within the scope of the present description.

Claims

1. A method for generating realistic video based on a plurality of artificial intelligence techniques, comprising:

step 1: receiving natural language data input by a user;

Step 2: preprocessing the natural language data;

2. The method for generating realistic videos according to claim 1, wherein in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.

3. The method for generating realistic video based on a plurality of artificial intelligence techniques according to claim 2, wherein the text data and the realistic video data are subjected to a cutting, labeling and normalization process.

4. The method for generating realistic videos based on multiple artificial intelligence techniques according to claim 2, wherein the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and increases the convergence rate of the model by reducing the training time.

5. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in step 3, the input parameters of the steady diffusion model include text data and corresponding video data, and wherein the text data is optimized data using the trained large language model.

6. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in the step 3, the stable diffusion model changes the style and quality effect of the image by the number of diffusion steps, the sampling rate and the noise intensity at the time of training.

7. A system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:

the data processing module is used for preprocessing the natural language data;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.