[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118138855A - Method and system for generating realistic video based on multiple artificial intelligence technologies - Google Patents

Method and system for generating realistic video based on multiple artificial intelligence technologies Download PDF

Info

Publication number
CN118138855A
CN118138855A CN202410383367.XA CN202410383367A CN118138855A CN 118138855 A CN118138855 A CN 118138855A CN 202410383367 A CN202410383367 A CN 202410383367A CN 118138855 A CN118138855 A CN 118138855A
Authority
CN
China
Prior art keywords
model
data
artificial intelligence
trained
generate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410383367.XA
Other languages
Chinese (zh)
Inventor
李佳楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaoduo Intelligent Technology Beijing Co ltd
Original Assignee
Xiaoduo Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoduo Intelligent Technology Beijing Co ltd filed Critical Xiaoduo Intelligent Technology Beijing Co ltd
Priority to CN202410383367.XA priority Critical patent/CN118138855A/en
Publication of CN118138855A publication Critical patent/CN118138855A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application discloses a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which relate to the technical field of video generation and are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.

Description

Method and system for generating realistic video based on multiple artificial intelligence technologies
Technical Field
The application relates to the technical field of video generation, in particular to a method and a system for generating a realistic video based on various artificial intelligence technologies.
Background
With the continuous development of technology, artificial intelligence technology has also made significant progress. In artificial intelligence technology, large Language Models (LLMs) and Generative Antagonism Networks (GANs) are two important directions of research.
The large-scale language model refers to a model constructed based on a deep learning technology and capable of processing and generating large-scale natural language text. These models typically consist of billions or even billions of parameters that enable learning of patterns, rules, and contexts of language and generating text with verisimilitude and consistency.
The generative antagonism network is a deep learning model composed of a generator and a discriminant. The generator attempts to generate realistic data samples, such as images, audio or text, and the arbiter attempts to distinguish between the data generated by the generator and the real data. The generator and the arbiter compete and boost each other in a manner of countermeasure training, so that the generator can generate high-quality data samples which are difficult to distinguish from real data.
However, most LLM models are currently not directly applicable to the task of video generation. This is because: first, most existing LLMs are trained based on text data, while the content of the video is composed of images; second, GANs has been widely used in image-toimage task (refer to a task in which the generated content is in units of images, i.e., the input and output are still images), but their use in video-tovideo task (refer to another task in which the generated content is dynamic video, i.e., the input and output are video) is relatively small.
Disclosure of Invention
To this end, the present application provides a method, system and computer program product for generating a realistic video based on a plurality of artificial intelligence techniques, so as to solve the problem that the prior art cannot integrate the plurality of artificial intelligence techniques together to generate the realistic video.
In order to achieve the above object, the present application provides the following technical solutions:
in a first aspect, a method for generating a realistic video based on a plurality of artificial intelligence techniques, comprises:
step 1: receiving natural language data input by a user;
Step 2: preprocessing the natural language data;
Step 3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
step 4: inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
step 5: inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
Optionally, in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.
Optionally, the text data and the real video data need to be cut, annotated and normalized.
Optionally, the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and accelerates the convergence rate of the model by reducing the training time.
Optionally, in step 3, the input parameters of the steady diffusion model include text data and corresponding video data when the steady diffusion model is trained, where the text data is optimized by using the trained large language model.
Optionally, in the step 3, the stable diffusion model changes the style and quality effect of the image by the diffusion step number, the sampling rate and the noise intensity during training.
In a second aspect, a system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
In a third aspect, a computer device includes a memory storing a computer program and a processor implementing steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.
In a fourth aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
In a fifth aspect, a computer program product comprises computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
Compared with the prior art, the application has at least the following beneficial effects:
The application provides a method and a system for generating a realistic video based on a plurality of artificial intelligence technologies, which are used for receiving natural language data input by a user; preprocessing natural language data; inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content; inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image; scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data. The method and the system for generating the realistic video based on the multiple artificial intelligence technologies can generate the video content with high resolution and realistic vision by utilizing the large language model, the stable diffusion model and the XDSora model, and solve the problem that the prior art can not integrate the multiple artificial intelligence technologies together so as to generate the realistic video.
Drawings
In order to more intuitively illustrate the prior art and the application, exemplary drawings are presented below. It should be understood that the specific shape and configuration shown in the drawings are not generally considered limiting conditions in carrying out the application; for example, those skilled in the art will be able to make routine adjustments or further optimizations for the addition/subtraction/attribution division, specific shapes, positional relationships, connection modes, dimensional proportion relationships, and the like of certain units (components) based on the technical concepts and the exemplary drawings disclosed in the present application.
Fig. 1 is a flowchart of a method for generating a realistic video based on multiple artificial intelligence techniques according to a first embodiment of the present application.
Detailed Description
The application will be further described in detail by means of specific embodiments with reference to the accompanying drawings.
In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this disclosure are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on the degree of importance or order, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).
The terms such as "upper", "lower", "left", "right", "middle", and the like, as used herein, are generally used for the purpose of facilitating an intuitive understanding with reference to the drawings and are not intended to be an absolute limitation of the positional relationship in actual products.
Example 1
Referring to fig. 1, the present embodiment provides a method for generating a realistic video based on multiple artificial intelligence techniques, including:
s1: receiving natural language data input by a user;
specifically, the natural language data may be text or voice; if the natural language data is input through voice, the voice data needs to be automatically converted into text data. According to the embodiment, the data can be input through natural language, so that the requirements of different people can be met.
S2: preprocessing natural language data;
S3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
Specifically, when the large language model is trained, a large amount of text data and corresponding real video data are firstly required to be obtained from various videos such as movies, television shows or advertisements according to the requirements of users to serve as a training set, and then the obtained text data and the corresponding real video data are cut, marked and normalized to obtain a final input training sample for storage for subsequent use. Wherein, the segmentation refers to uniformly dividing each frame of image into a plurality of small sub-blocks so as to facilitate the classification and use of data; marking is to mark the pictures of each small sub-as different categories or labels for model identification and learning respectively; the normalization is to convert each pixel value into a range between 0 and 255, so that the calculation by a computer is convenient, excessive errors can not occur, and the stability and the performance of the model are ensured.
In this embodiment, the large language model has strong natural language understanding capability and generating capability, and can accurately understand and analyze the content and requirements input by the user, and output corresponding scene content (i.e., storyline and character action commentary) according to the requirements. The accuracy of the model can be improved by adjusting parameters, such as the size of the corpus, and the convergence rate of the model can be increased by reducing training time.
S4: inputting scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
specifically, the input parameters of the Stable Diffusion (Stable Diffusion) model include text data and corresponding video data during training, wherein the text data is optimized data by using a trained large language model.
In this embodiment, stable Diffusion has a strong image generation capability, and can convert text data optimized by a large language model into a high-resolution image. Stable Diffusion changes the style and quality effect of the image by the number of Diffusion steps, sampling rate, and noise intensity during training.
S5: scene content and high resolution images are input into a pre-trained XDSora model to generate realistic video data.
Specifically, the XDSora model is based on the Sora model (i.e., the Jiang Wensheng-most Video model), and the Sora model is a Text to Video large model. Compared with Sora model, XDSora model enables the performance to be more lifelike in 2D digital human interaction scene by introducing a large amount of 2D high-definition human broadcasting materials as training sets. By optimizing the training set, XDSora models can better understand the text description and convert it to realistic video content. This improvement allows XDSora to have a higher realism and fidelity in rendering a 2D digital human interactive scene.
In this embodiment, the XDSora model has strong capability of understanding space-time relationship and extracting context information, and simultaneously considers the change in time dimension and the position relationship in space dimension, thereby realizing more real and natural video generation.
According to the method for generating the realistic video based on the multiple artificial intelligence technologies, the large-scale language model, the stable diffusion model and the XDSora model which are trained in advance are integrated into one unified frame, and when a user needs to generate the realistic video of the corresponding style, the user only needs to provide the needed content to the realistic video to obtain the corresponding video output. Through the precise matching and optimization of the three models, high-resolution and visual lifelike video content can be generated.
In summary, the method for generating a realistic video based on multiple artificial intelligence technologies provided by the embodiment has the following advantages:
(1) The generated video has high resolution, vivid shadow effect and rich details, and can meet the requirements of users on video quality;
(2) The role actions, scene switching and the like are similar to the real world, are difficult to distinguish from the real video, and have strong sense of realism;
(3) The method can generate video content of corresponding style according to different types of text descriptions input by a user so as to adapt to various requirements;
(4) Further optimization processes, such as changing character images, adjusting scene layout, etc., can be performed according to preferences for themselves, with adjustability.
Example two
The embodiment provides a system for generating a realistic video based on a plurality of artificial intelligence technologies, which comprises:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
The specific implementation content of each module in a system for generating a realistic video based on multiple artificial intelligence techniques can be referred to as the definition of a method for generating a realistic video based on multiple artificial intelligence techniques, which is not described herein.
Example III
The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques when the computer program is executed.
Example IV
The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of generating realistic video based on a plurality of artificial intelligence techniques.
Example five
The present embodiments provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of a method of generating realistic video based on a variety of artificial intelligence techniques.
Any combination of the technical features of the above embodiments may be performed (as long as there is no contradiction between the combination of the technical features), and for brevity of description, all of the possible combinations of the technical features of the above embodiments are not described; these examples, which are not explicitly written, should also be considered as being within the scope of the present description.

Claims (10)

1. A method for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
step 1: receiving natural language data input by a user;
Step 2: preprocessing the natural language data;
Step 3: inputting the preprocessed natural language data into a pre-trained large language model to generate corresponding scene content;
step 4: inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
step 5: inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
2. The method for generating realistic videos according to claim 1, wherein in the step 2, the large language model needs to acquire a large amount of text data and corresponding real video data from various videos as a training set during training.
3. The method for generating realistic video based on a plurality of artificial intelligence techniques according to claim 2, wherein the text data and the realistic video data are subjected to a cutting, labeling and normalization process.
4. The method for generating realistic videos based on multiple artificial intelligence techniques according to claim 2, wherein the large language model improves the accuracy of the model by adjusting the size of the prediction library during training, and increases the convergence rate of the model by reducing the training time.
5. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in step 3, the input parameters of the steady diffusion model include text data and corresponding video data, and wherein the text data is optimized data using the trained large language model.
6. The method for generating realistic video based on multiple artificial intelligence techniques according to claim 1, wherein in the step 3, the stable diffusion model changes the style and quality effect of the image by the number of diffusion steps, the sampling rate and the noise intensity at the time of training.
7. A system for generating realistic video based on a plurality of artificial intelligence techniques, comprising:
the data receiving module is used for receiving natural language data input by a user;
the data processing module is used for preprocessing the natural language data;
The scene content generation module is used for inputting the preprocessed natural language data into a pre-trained large-scale language model to generate corresponding scene content;
the image generation module is used for inputting the scene content into a pre-trained stable diffusion model to generate a corresponding high-resolution image;
And the video generation module is used for inputting the scene content and the high-resolution image into a pre-trained XDSora model to generate realistic video data.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.
CN202410383367.XA 2024-03-29 2024-03-29 Method and system for generating realistic video based on multiple artificial intelligence technologies Pending CN118138855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410383367.XA CN118138855A (en) 2024-03-29 2024-03-29 Method and system for generating realistic video based on multiple artificial intelligence technologies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410383367.XA CN118138855A (en) 2024-03-29 2024-03-29 Method and system for generating realistic video based on multiple artificial intelligence technologies

Publications (1)

Publication Number Publication Date
CN118138855A true CN118138855A (en) 2024-06-04

Family

ID=91237712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410383367.XA Pending CN118138855A (en) 2024-03-29 2024-03-29 Method and system for generating realistic video based on multiple artificial intelligence technologies

Country Status (1)

Country Link
CN (1) CN118138855A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118507087A (en) * 2024-07-22 2024-08-16 浪潮云信息技术股份公司 Medical industry information transfer method, device, equipment and medium based on large model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118507087A (en) * 2024-07-22 2024-08-16 浪潮云信息技术股份公司 Medical industry information transfer method, device, equipment and medium based on large model

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
CN116664726B (en) Video acquisition method and device, storage medium and electronic equipment
CN113411550B (en) Video coloring method, device, equipment and storage medium
US20220375223A1 (en) Information generation method and apparatus
Zhao et al. Computer-aided graphic design for virtual reality-oriented 3D animation scenes
CN118138855A (en) Method and system for generating realistic video based on multiple artificial intelligence technologies
JP2023545052A (en) Image processing model training method and device, image processing method and device, electronic equipment, and computer program
Zhang et al. A survey on multimodal-guided visual content synthesis
CN117011875A (en) Method, device, equipment, medium and program product for generating multimedia page
Rao et al. UMFA: a photorealistic style transfer method based on U-Net and multi-layer feature aggregation
Cai et al. Application characteristics and innovation of digital technology in visual communication design
CN113298704B (en) Skin color segmentation and beautification method by utilizing graph migration under broadcast television news
Kumar et al. Computer Vision and Creative Content Generation: Text-to-Sketch Conversion
CN111695323B (en) Information processing method and device and electronic equipment
CN115063800B (en) Text recognition method and electronic equipment
CN116939288A (en) Video generation method and device and computer equipment
CN117061785A (en) Method, device, equipment and storage medium for generating information broadcast video
Togo et al. Text-guided style transfer-based image manipulation using multimodal generative models
CN111193795B (en) Information pushing method and device, electronic equipment and computer readable storage medium
Hu Visual Health Analysis of Print Advertising Graphic Design Based on Image Segmentation and Few‐Shot Learning
Xiang et al. Panoramic image style transfer technology based on multi-attention fusion
Kong et al. DualPathGAN: Facial reenacted emotion synthesis
Liu et al. Visual Space Design of Digital Media Art Using Virtual Reality and Multidimensional Space
Gao et al. EL‐GAN: Edge‐Enhanced Generative Adversarial Network for Layout‐to‐Image Generation
Yu et al. Visual Communication Design Method Based on Multimedia Information Processing Technology and Its Application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination