CN118138855A - Method and system for generating realistic video based on multiple artificial intelligence technologies - Google Patents
Method and system for generating realistic video based on multiple artificial intelligence technologies Download PDFInfo
- Publication number
- CN118138855A CN118138855A CN202410383367.XA CN202410383367A CN118138855A CN 118138855 A CN118138855 A CN 118138855A CN 202410383367 A CN202410383367 A CN 202410383367A CN 118138855 A CN118138855 A CN 118138855A
- Authority
- CN
- China
- Prior art keywords
- model
- data
- artificial intelligence
- trained
- generate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 34
- 238000005516 engineering process Methods 0.000 title abstract description 16
- 238000009792 diffusion process Methods 0.000 claims abstract description 23
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000012549 training Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 claims description 2
- 238000005520 cutting process Methods 0.000 claims 1
- 238000002372 labelling Methods 0.000 claims 1
- 241000036848 Porzana carolina Species 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008485 antagonism Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8146—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Graphics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Security & Cryptography (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application discloses a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which relate to the technical field of video generation and are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.
Description
Technical Field
The application relates to the technical field of video generation, in particular to a method and a system for generating a realistic video based on various artificial intelligence technologies.
Background
With the continuous development of technology, artificial intelligence technology has also made significant progress. In artificial intelligence technology, large Language Models (LLMs) and Generative Antagonism Networks (GANs) are two important directions of research.
The large-scale language model refers to a model constructed based on a deep learning technology and capable of processing and generating large-scale natural language text. These models typically consist of billions or even billions of parameters that enable learning of patterns, rules, and contexts of language and generating text with verisimilitude and consistency.
The generative antagonism network is a deep learning model composed of a generator and a discriminant. The generator attempts to generate realistic data samples, such as images, audio or text, and the arbiter attempts to distinguish between the data generated by the generator and the real data. The generator and the arbiter compete and boost each other in a manner of countermeasure training, so that the generator can generate high-quality data samples which are difficult to distinguish from real data.
However, most LLM models are currently not directly applicable to the task of video generation. This is because: first, most existing LLMs are trained based on text data, while the content of the video is composed of images; second, GANs has been widely used in image-toimage task (refer to a task in which the generated content is in units of images, i.e., the input and output are still images), but their use in video-tovideo task (refer to another task in which the generated content is dynamic video, i.e., the input and output are video) is relatively small.
Disclosure of Invention
To this end, the present application provides a method, system and computer program product for generating a realistic video based on a plurality of artificial intelligence techniques, so as to solve the problem that the prior art cannot integrate the plurality of artificial intelligence techniques together to generate the realistic video.
In order to achieve the above object, the present application provides the following technical solutions:
in a first aspect, a method for generating a realistic video based on a plurality of artificial intelligence techniques, comprises:
step 1: receiving natural language data input by a user;
Step 2: preprocessing the natural language data;
Step 3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
step 4: inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
step 5: inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
Optionally, in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.
Optionally, the text data and the real video data need to be cut, annotated and normalized.
Optionally, the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and accelerates the convergence rate of the model by reducing the training time.
Optionally, in step 3, the input parameters of the steady diffusion model include text data and corresponding video data when the steady diffusion model is trained, where the text data is optimized by using the trained large language model.
Optionally, in the step 3, the stable diffusion model changes the style and quality effect of the image by the diffusion step number, the sampling rate and the noise intensity during training.
In a second aspect, a system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
In a third aspect, a computer device includes a memory storing a computer program and a processor implementing steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.
In a fourth aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
In a fifth aspect, a computer program product comprises computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
Compared with the prior art, the application has at least the following beneficial effects:
The application provides a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.
Drawings
In order to more intuitively illustrate the prior art and the application, exemplary drawings are presented below. It should be understood that the specific shape and configuration shown in the drawings are not generally considered limiting conditions in carrying out the application; for example, those skilled in the art will be able to make routine adjustments or further optimizations for the addition/subtraction/attribution division, specific shapes, positional relationships, connection modes, dimensional proportion relationships, and the like of certain units (components) based on the technical concepts and the exemplary drawings disclosed in the present application.
Fig. 1 is a flowchart of a method for generating a realistic video based on multiple artificial intelligence techniques according to a first embodiment of the present application.
Detailed Description
The application will be further described in detail by means of specific embodiments with reference to the accompanying drawings.
In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this disclosure are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on the degree of importance or order, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).
The terms such as "upper", "lower", "left", "right", "middle", and the like, as used herein, are generally used for the purpose of facilitating an intuitive understanding with reference to the drawings and are not intended to be an absolute limitation of the positional relationship in actual products.
Example 1
Referring to fig. 1, the present embodiment provides a method for generating a realistic video based on multiple artificial intelligence techniques, including:
s1: receiving natural language data input by a user;
specifically, the natural language data may be text or voice; if the natural language data is input through voice, the voice data needs to be automatically converted into text data. According to the embodiment, the data can be input through natural language, so that the requirements of different people can be met.
S2: preprocessing natural language data;
S3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
Specifically, when the large language model is trained, a large amount of text data and corresponding real video data are firstly required to be obtained from various videos such as movies, television shows or advertisements according to the requirements of users to serve as a training set, and then the obtained text data and the corresponding real video data are cut, marked and normalized to obtain a final input training sample for storage for subsequent use. Wherein, the segmentation refers to uniformly dividing each frame of image into a plurality of small sub-blocks so as to facilitate the classification and use of data; marking is to mark the pictures of each small sub-as different categories or labels for model identification and learning respectively; the normalization is to convert each pixel value into a range between 0 and 255, so that the calculation by a computer is convenient, excessive errors can not occur, and the stability and the performance of the model are ensured.
In this embodiment, the large language model has strong natural language understanding capability and generating capability, and can accurately understand and analyze the content and requirements input by the user, and output corresponding scene content (i.e., storyline and character action commentary) according to the requirements. The accuracy of the model can be improved by adjusting parameters, such as the size of the corpus, and the convergence rate of the model can be increased by reducing training time.
S4: inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
specifically, the input parameters of the Stable Diffusion (Stable Diffusion) model include text data and corresponding video data during training, wherein the text data is optimized data by using a trained large language model.
In this embodiment, stable Diffusion has a strong image generation capability, and can convert text data optimized by a large language model into a high-resolution image. Stable Diffusion changes the style and quality effect of the image by the number of Diffusion steps, sampling rate, and noise intensity during training.
S5: scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data.
Specifically, the XDSora model is based on the Sora model (i.e., the Jiang Wensheng-most Video model), and the Sora model is a Text to Video large model. Compared with Sora model, XDSora model enables the performance to be more lifelike in 2D digital human interaction scene by introducing a large amount of 2D high-definition human broadcasting materials as training sets. By optimizing the training set, XDSora models can better understand the text description and convert it to realistic video content. This improvement allows XDSora to have a higher realism and fidelity in rendering a 2D digital human interactive scene.
In this embodiment, the XDSora model has strong capability of understanding space-time relationship and extracting context information, and simultaneously considers the change in time dimension and the position relationship in space dimension, thereby realizing more real and natural video generation.
According to the method for generating the realistic video based on the multiple artificial intelligence technologies, the large-scale language model, the stable diffusion model and the XDSora model which are trained in advance are integrated into one unified frame, and when a user needs to generate the realistic video of the corresponding style, the user only needs to provide the needed content to the realistic video to obtain the corresponding video output. Through the precise matching and optimization of the three models, high-resolution and visual lifelike video content can be generated.
In summary, the method for generating a realistic video based on multiple artificial intelligence technologies provided by the embodiment has the following advantages:
(1) The generated video has high resolution, vivid shadow effect and rich details, and can meet the requirements of users on video quality;
(2) The role actions, scene switching and the like are similar to the real world, are difficult to distinguish from the real video, and have strong sense of realism;
(3) The method can generate video content of corresponding style according to different types of text descriptions input by a user so as to adapt to various requirements;
(4) Further optimization processes, such as changing character images, adjusting scene layout, etc., can be performed according to preferences for themselves, with adjustability.
Example two
The embodiment provides a system for generating a realistic video based on a plurality of artificial intelligence technologies, which comprises:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
The specific implementation content of each module in a system for generating a realistic video based on multiple artificial intelligence techniques can be referred to as the definition of a method for generating a realistic video based on multiple artificial intelligence techniques, which is not described herein.
Example III
The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.
Example IV
The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
Example five
The present embodiments provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a variety of artificial intelligence techniques.
Any combination of the technical features of the above embodiments may be performed (as long as there is no contradiction between the combination of the technical features), and for brevity of description, all of the possible combinations of the technical features of the above embodiments are not described; these examples, which are not explicitly written, should also be considered as being within the scope of the present description.
Claims (10)
1. A method for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
step 1: receiving natural language data input by a user;
Step 2: preprocessing the natural language data;
Step 3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
step 4: inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
step 5: inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
2. The method for generating realistic videos according to claim 1, wherein in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.
3. The method for generating realistic video based on a plurality of artificial intelligence techniques according to claim 2, wherein the text data and the realistic video data are subjected to a cutting, labeling and normalization process.
4. The method for generating realistic videos based on multiple artificial intelligence techniques according to claim 2, wherein the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and increases the convergence rate of the model by reducing the training time.
5. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in step 3, the input parameters of the steady diffusion model include text data and corresponding video data, and wherein the text data is optimized data using the trained large language model.
6. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in the step 3, the stable diffusion model changes the style and quality effect of the image by the number of diffusion steps, the sampling rate and the noise intensity at the time of training.
7. A system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410383367.XA CN118138855A (en) | 2024-03-29 | 2024-03-29 | Method and system for generating realistic video based on multiple artificial intelligence technologies |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410383367.XA CN118138855A (en) | 2024-03-29 | 2024-03-29 | Method and system for generating realistic video based on multiple artificial intelligence technologies |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118138855A true CN118138855A (en) | 2024-06-04 |
Family
ID=91237712
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410383367.XA Pending CN118138855A (en) | 2024-03-29 | 2024-03-29 | Method and system for generating realistic video based on multiple artificial intelligence technologies |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118138855A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118507087A (en) * | 2024-07-22 | 2024-08-16 | 浪潮云信息技术股份公司 | Medical industry information transfer method, device, equipment and medium based on large model |
-
2024
- 2024-03-29 CN CN202410383367.XA patent/CN118138855A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118507087A (en) * | 2024-07-22 | 2024-08-16 | 浪潮云信息技术股份公司 | Medical industry information transfer method, device, equipment and medium based on large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111858954B (en) | Task-oriented text-generated image network model | |
CN116664726B (en) | Video acquisition method and device, storage medium and electronic equipment | |
CN113411550B (en) | Video coloring method, device, equipment and storage medium | |
US20220375223A1 (en) | Information generation method and apparatus | |
Zhao et al. | Computer-aided graphic design for virtual reality-oriented 3D animation scenes | |
CN118138855A (en) | Method and system for generating realistic video based on multiple artificial intelligence technologies | |
JP2023545052A (en) | Image processing model training method and device, image processing method and device, electronic equipment, and computer program | |
Zhang et al. | A survey on multimodal-guided visual content synthesis | |
CN117011875A (en) | Method, device, equipment, medium and program product for generating multimedia page | |
Rao et al. | UMFA: a photorealistic style transfer method based on U-Net and multi-layer feature aggregation | |
Cai et al. | Application characteristics and innovation of digital technology in visual communication design | |
CN113298704B (en) | Skin color segmentation and beautification method by utilizing graph migration under broadcast television news | |
Kumar et al. | Computer Vision and Creative Content Generation: Text-to-Sketch Conversion | |
CN111695323B (en) | Information processing method and device and electronic equipment | |
CN115063800B (en) | Text recognition method and electronic equipment | |
CN116939288A (en) | Video generation method and device and computer equipment | |
CN117061785A (en) | Method, device, equipment and storage medium for generating information broadcast video | |
Togo et al. | Text-guided style transfer-based image manipulation using multimodal generative models | |
CN111193795B (en) | Information pushing method and device, electronic equipment and computer readable storage medium | |
Hu | Visual Health Analysis of Print Advertising Graphic Design Based on Image Segmentation and Few‐Shot Learning | |
Xiang et al. | Panoramic image style transfer technology based on multi-attention fusion | |
Kong et al. | DualPathGAN: Facial reenacted emotion synthesis | |
Liu et al. | Visual Space Design of Digital Media Art Using Virtual Reality and Multidimensional Space | |
Gao et al. | EL‐GAN: Edge‐Enhanced Generative Adversarial Network for Layout‐to‐Image Generation | |
Yu et al. | Visual Communication Design Method Based on Multimedia Information Processing Technology and Its Application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |