[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114007135B - Video frame insertion method and device, equipment, medium and product thereof - Google Patents

Video frame insertion method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN114007135B
CN114007135B CN202111267436.3A CN202111267436A CN114007135B CN 114007135 B CN114007135 B CN 114007135B CN 202111267436 A CN202111267436 A CN 202111267436A CN 114007135 B CN114007135 B CN 114007135B
Authority
CN
China
Prior art keywords
frame
image
optical flow
frame images
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111267436.3A
Other languages
Chinese (zh)
Other versions
CN114007135A (en
Inventor
叶艾彦
戴长军
丘文威
冯进亨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huanju Mark Network Information Co ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN202111267436.3A priority Critical patent/CN114007135B/en
Publication of CN114007135A publication Critical patent/CN114007135A/en
Application granted granted Critical
Publication of CN114007135B publication Critical patent/CN114007135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440281Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234381Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0127Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level by changing the field or frame frequency of the incoming video signal, e.g. frame rate converter

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a video frame insertion method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring a target video to be subjected to frame interpolation processing, and extracting two reference frame images which are continuous in a time domain from the target video; calculating optical flow prediction vectors of a transition frame image between the two reference frame images relative to the two reference frame images by using a pre-trained optical flow prediction model; generating residual error information of the transition frame images by a pre-trained frame-interpolation synthetic model according to the optical flow prediction vectors and the image feature vectors of the two reference frame images, wherein the residual error information comprises residual error values and image mapping weights; and the two reference frame images are referred by a pre-trained frame insertion synthetic model, the transition frame image is generated according to each vector and the residual error information, and the transition frame image is inserted between the two reference frame images for playing. The method and the device have the advantages that the effect of inserting frames into the target video by the end-to-end method so as to improve the video display quality of the target video is achieved, and the application prospect is wide.

Description

Video frame insertion method and device, equipment, medium and product thereof
Technical Field
The present application relates to video image processing technologies, and in particular, to a video frame interpolation method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.
Background
In order to improve the image quality of the video, for the video stream with a lower frame rate, a video frame insertion technology can be adopted to supplement the transition frame for the video stream, so that the video stream shows a smooth display effect when being played. In the prior art, there are multiple ways to implement frame interpolation for video to improve image quality, and the display effects obtained by different ways are different, wherein a widely used way is to perform feature extraction and optical flow prediction on video images by means of a convolutional neural network model, and then generate transition frames.
The applicant has found that many known network architectures in the prior art are not ideal in the playing effect obtained by performing frame interpolation on video images, and therefore, the applicant hopes to make corresponding contribution to the prior art by exploration so as to achieve a better video playing effect.
Disclosure of Invention
A primary object of the present application is to solve at least one of the above problems and provide a video frame insertion method, and corresponding apparatus, computer device, computer readable storage medium, and computer program product.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
a video frame interpolation method adapted to one of the objects of the present application includes the steps of:
acquiring a target video to be subjected to frame interpolation processing, and extracting two reference frame images which are continuous in a time domain from the target video;
calculating an optical flow prediction vector of a transition frame image between the two reference frame images relative to the two reference frame images by a pre-trained optical flow prediction model;
generating residual error information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the two reference frame images by a pre-trained frame interpolation synthesis model, wherein the residual error information comprises residual error values and image mapping weights;
and the pre-trained frame insertion synthesis model refers to the two reference frame images, the transition frame image is generated according to each vector and the residual error information, and the transition frame image is inserted between the two reference frame images for playing.
In a deepened embodiment, acquiring a target video to be frame-inserted, and extracting two reference frame images which are continuous in a time domain from the target video, includes the following steps:
acquiring frame rate data of a video to be played;
comparing the frame rate data with a preset frame rate threshold;
when the frame rate value represented by the frame rate data is smaller than the frame rate threshold value, determining that the video to be played is the target video;
and extracting two reference frame images along the time domain of the target video for frame interpolation processing.
In a further embodiment, calculating an optical-flow prediction vector of a transition frame image between two reference frame images relative to the two reference frame images from a pre-trained optical-flow prediction model comprises the following steps:
generating a superimposed image after the two reference frame images are superimposed by channel images;
performing convolution pooling on the superposed image through a convolution layer to generate a downsampling characteristic;
interpolating the downsampling features through an deconvolution layer to generate upsampling features;
and performing feature fusion and superposition on the downsampling features and the upsampling features to generate optical flow prediction vectors of the transition frame image relative to the two reference frame images.
In a further embodiment, generating residual information of the transition frame image from the optical flow prediction vectors and the image feature vectors of the two reference frame images by a pre-trained frame-interpolation synthesis model includes the following steps:
extracting image feature vectors of the two reference frame images by a pre-trained image feature extraction model;
calculating the residual value corresponding to the transition frame image by the interpolation frame synthesis model according to the two image feature vectors and the corresponding optical flow prediction vectors;
and synthesizing a mask map for representing the image mapping weight by using the frame interpolation synthesis model and taking two reference frame maps as references.
In a further embodiment, the two reference frame images are referred to by a pre-trained frame-insertion synthesis model, the transition frame image is generated according to the vectors and the residual information, and the transition frame image is inserted between the two reference frame images for playing, which includes the following steps:
respectively carrying out corresponding image transformation on the two reference frame images according to the optical flow prediction vector to obtain two mapping frame images;
smoothly synthesizing the two mapping frame images by taking the image mapping weight as a hyper-parameter to obtain a fusion frame image;
and overlapping the fused frame image with the residual value to obtain a transition frame image.
In an extended embodiment, the optical flow calculation model and the frame interpolation synthesis model are jointly trained, and the training process comprises the following steps:
performing framing processing on a pre-acquired sample video to generate a sample atlas, wherein the sample atlas comprises: the time interval detection method comprises the following steps of obtaining two training frame images and a sample frame image, wherein the sample frame image is located in a time interval corresponding to the two training frame images;
inputting the two training frame images into an optical flow calculation model which is pre-trained to a convergence state to calculate an optical flow real vector of the two training frame images;
inputting the two training frame images into the trained optical flow prediction model to calculate optical flow prediction vectors of the transition frame images relative to the two training frame images;
inputting the two training frame images into an image feature extraction model which is pre-trained to a convergence state to obtain two corresponding image feature vectors;
inputting two training frame images, image characteristic vectors thereof and optical flow prediction vectors into the trained frame interpolation synthesis model to calculate residual error information, and obtaining corresponding transition frame images;
calculating a loss value between the transition frame image and the sample frame image according to a preset loss function, continuing iterative training when the loss value is greater than a preset loss threshold, wherein the loss value is a weighted sum of a plurality of difference values, and the plurality of difference values comprise: loss difference between optical flow prediction vector and optical flow real vector, mean square error of semantic feature between the sample frame image and the transition frame image, absolute error between the sample image and mapping frame image calculated according to the residual information.
A video frame interpolation apparatus adapted to one of the objects of the present application includes: the device comprises a reference module, an optical flow prediction module, a residual error generation module and an interpolation frame synthesis module, wherein the reference module is used for acquiring a target video to be subjected to frame interpolation processing and extracting two reference frame images which are continuous in a time domain in the target video; the optical flow prediction module is used for calculating optical flow prediction vectors of a transition frame image between two reference frame images relative to the two reference frame images by a pre-trained optical flow prediction model; the residual error generating module is used for generating residual error information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the two reference frame images by a pre-trained frame-inserted synthesis model, wherein the residual error information comprises residual error values and image mapping weights; and the frame interpolation synthesis module is used for referencing the two reference frame images by a pre-trained frame interpolation synthesis model, generating the transition frame image according to each vector and the residual error information, and inserting the transition frame image between the two reference frame images for playing.
In a further embodiment, the reference module comprises: the frame rate acquisition submodule is used for acquiring frame rate data of a video to be played; the frame rate comparison submodule is used for comparing the frame rate data with a preset frame rate threshold; the video determining submodule is used for determining the video to be played as the target video when the frame rate value represented by the frame rate data is smaller than the frame rate threshold value; and the frame interpolation starting submodule is used for extracting two reference frame images along the time domain of the target video so as to perform frame interpolation processing.
In a further embodiment, the optical flow prediction module comprises: the channel merging submodule is used for generating a superimposed image after the two reference frame images are superimposed; the convolution pooling submodule is used for performing convolution pooling processing on the superposed image through a convolution layer to generate down-sampling characteristics; the deconvolution pooling submodule is used for carrying out interpolation processing on the down-sampling features through a deconvolution layer to generate up-sampling features; and the prediction generation sub-module is used for performing feature fusion and superposition on the down-sampling features and the up-sampling features to generate optical flow prediction vectors of the transition frame image relative to the two reference frame images.
In a further embodiment, the residual generating module comprises: the feature extraction sub-module is used for extracting image feature vectors of the two reference frame images by a pre-trained image feature extraction model; the residual error calculation submodule is used for calculating the residual error value corresponding to the transition frame image according to the two image feature vectors and the corresponding optical flow prediction vectors by the interpolation frame synthesis model; and the information output submodule is used for synthesizing a mask image for representing the image mapping weight by using the two reference frame images as reference through the frame insertion synthesis model.
In a further embodiment, the frame-interpolation synthesis module comprises: the image transformation sub-module is used for respectively carrying out corresponding image transformation on the two reference frame images according to the optical flow prediction vector to obtain two mapping frame images; the smooth synthesis sub-module is used for performing smooth synthesis on the two mapping frame images by taking the image mapping weight as a hyper-parameter to obtain a fusion frame image; and the fusion generation submodule is used for superposing the fusion frame image on the residual value to obtain a transition frame image.
In an extended embodiment, the optical flow calculation model and the frame interpolation synthesis model are jointly trained, and the training device comprises: the system comprises an atlas generation module, a sampling atlas generation module and an atlas database module, wherein the atlas generation module is used for performing framing processing on a sample video acquired in advance to generate a sample atlas, and the sample atlas comprises: the time interval detection method comprises the following steps of obtaining two training frame images and a sample frame image, wherein the sample frame image is located in a time interval corresponding to the two training frame images; the optical flow calculation module is used for inputting the two training frame images into an optical flow calculation model which is pre-trained to a convergence state to calculate an optical flow real vector of the two training frame images; the optical flow prediction module is used for inputting the two training frame images into the trained optical flow prediction model to calculate optical flow prediction vectors of the transition frame images relative to the two training frame images; the feature extraction module is used for inputting the two training frame images into an image feature extraction model which is pre-trained to a convergence state to obtain two corresponding image feature vectors; the comprehensive generation module is used for inputting the two training frame images, the image characteristic vectors and the light stream prediction vectors thereof into the trained frame interpolation synthetic model to calculate residual error information so as to obtain corresponding transition frame images; a gradient updating module, configured to calculate a loss value between the transition frame image and the sample frame image according to a preset loss function, and continue iterative training when the loss value is greater than a preset loss threshold, where the loss value is a weighted sum of a plurality of difference values, where the plurality of difference values include: loss difference between an optical flow prediction vector and an optical flow real vector, mean square error of semantic features between the sample frame image and the transition frame image, and absolute error between the sample image and a mapping frame image calculated according to the residual information.
A computer device adapted for one of the purposes of the present application comprises a central processing unit and a memory, the central processing unit being configured to invoke the execution of a computer program stored in the memory to perform the steps of the video framing method described herein.
A computer-readable storage medium is provided, which stores in the form of computer-readable instructions a computer program for implementing the video framing method, which when called by a computer performs the steps included in the method.
A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.
Compared with the prior art, the application has the following advantages:
the method comprises the steps of utilizing every two time-domain continuous reference frame images in a target video to predict optical flow prediction vectors of the transition frame images relative to the two reference frame images, determining residual error information corresponding to the optical flow prediction vectors and image feature vectors of the two reference frames, referring to the two reference frame images to generate a transition frame image between the two reference frame images according to the residual error information, inserting the transition frame image between the two reference frame images for playing, and completing the frame inserting process. The process is based on an end-to-end mechanism, frame interpolation can be achieved only by inputting a target video, and because a transition frame image is generated according to residual error information in the frame interpolation process, the residual error information can be generated according to the character, namely an optical flow prediction vector, and the calculation of the residual error information is more efficient, the related network architecture of the application is achieved, the transition frame image with higher quality can be obtained in the production stage for improving the playing quality of the target video, the fast convergence is easier in the training stage, the training efficiency of the network architecture is improved, and the training cost is effectively reduced.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart illustrating an exemplary embodiment of a video frame interpolation method according to the present application;
FIG. 2 is a flowchart illustrating a process of extracting two reference frame maps according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating an embodiment of obtaining optical flow prediction vectors from two reference frame maps;
fig. 4 is a schematic flow chart of a process of obtaining residual error information in the embodiment of the present application;
FIG. 5 is a flowchart illustrating a process of performing image transformation to generate a transition frame map according to an embodiment of the present application;
FIG. 6 is a flow chart illustrating a training process of a network architecture for implementing the video frame interpolation method of the present application;
FIG. 7 is a schematic block diagram of a network architecture for implementing the video frame insertion method of the present application, wherein the dotted portion is enabled only during the training process and is disabled when the network architecture is put into production;
FIG. 8 is a schematic block diagram of a video frame interpolation apparatus of the present application;
fig. 9 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having a single line display or a multi-line display or cellular or other communication devices without a multi-line display; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" in the present application can be extended to the case of server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations to this.
The video frame insertion method can be programmed into a computer program product, is deployed in a client or a server to run, and is generally deployed in the server to implement, for example, in a network video live broadcast application scenario of the present application, so that the method can be executed by accessing an open interface after the computer program product runs and performing human-computer interaction with a process of the computer program product through a graphical user interface.
Referring to fig. 1, in an exemplary embodiment of a video frame interpolation method of the present application, the method includes the following steps:
step S1100, acquiring a target video to be frame-interpolated, and extracting two reference frame images which are continuous in a time domain in the target video:
the target video in this embodiment is a video to be processed selected for frame interpolation processing and for raising the video frame rate.
The target video can be a network video sent to the terminal through the server side, and can also be a local video stored locally in the terminal. According to different specific embodiments, in some embodiments, the video frame interpolation method in this embodiment can also be used to process video data uploaded by a terminal, where the target video is a video uploaded by the terminal.
The target video can be obtained by screening in advance, and the screening mode mainly comprises the following steps: and screening through the code rate or the frame rate. Specifically, when the target video is a network transmission video, the terminal reads the code rate of the video data at the network port after receiving the video data sent by the server, and when the code rate is lower than a preset code rate threshold, the video data is determined to be the target video. And when the video is the local video, the terminal reads the frame rate parameter of the video, and when the value represented by the frame rate parameter is smaller than the frame rate threshold value, the video data is determined to be the target video. In some embodiments, when the video frame insertion method is used for processing video data uploaded by a terminal, a server side reads a code rate of the data uploaded by the terminal, and determines that the uploaded video data is a target video when the code rate is lower than a preset code rate threshold.
After the target video is determined, two frame images which are continuous in a time domain in the target video are extracted, the two frame images are defined as a first reference frame image and a second reference frame image, and the first reference frame image and the second reference frame image are continuous in the time axis and are connected in sequence, so that a transition frame image is inserted into the two reference frame images through the implementation of the technical scheme of the application.
In some embodiments, the selection of the first reference frame map and the second reference frame map requires considering the requirement of scene transition. If the two reference frame pictures are different pictures of two scenes before and after the transition, it may not be necessary to insert the transition frame picture for the two reference frame pictures at this time. Therefore, the two reference frame images can be input into a pre-trained transition classification model for judgment, if two images before and after transition are judged, the images can not be subjected to frame interpolation, and if a transition relation between the two reference frame images is judged, the images can be subjected to frame interpolation. It can be understood that the transition classification model is pre-trained to a convergence state, and is suitable for judging whether an image natural transition relation exists between two reference frame images, so as to serve the needs of the application.
Step S1200, calculating, by the pre-trained optical flow prediction model, optical flow prediction vectors of the transition frame image between the two reference frame images relative to the two reference frame images:
the optical flow prediction model is pre-trained to a convergence state and used for calculating optical flow prediction vectors of a transition frame image between two reference frame images relative to the two reference frame images according to the two reference frame images. The optical flow prediction vectors are used to characterize motion vectors between motion pixels in the transition frame map relative to corresponding motion pixels of the two reference frame maps. Thus, a first optical-flow prediction vector and a second optical-flow prediction vector are included, respectively corresponding to the first reference frame image and the second reference frame image, of the transition frame image.
The optical flow prediction model is realized by adopting a neural network model based on a convolutional layer, and specifically, the optical flow prediction model superposes pixels of a first reference frame image and a second reference frame image, when the pixels are superposed, the image sizes of the first reference frame image and the second reference frame image are adjusted to be consistent, two reference frame images are respectively split into three color channels according to RGB colors, the three color channels are respectively a red channel, a green channel and a blue channel, then, the channel colors are taken as categories, images in the categories are weighted and superposed, and after the three channels are respectively superposed, the superposed three channel images are combined to generate a superposed image.
And extracting a motion vector between the transition frame and the first reference frame image relative to the second reference frame image from the superposed image, so that an optical flow prediction vector between the first reference frame image and the second reference frame image is obtained after the superposed image is subjected to feature extraction through a convolution layer of an optical flow prediction model, and the optical flow prediction vector represents a change state between the first reference frame image and the second reference frame image.
Step 1300, generating residual information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the two reference frame images by a pre-trained frame-interpolation synthesis model, wherein the residual information comprises residual values and image mapping weights:
and utilizing a pre-trained frame-inserting synthesis model for generating residual information between the first reference frame image and the second reference frame image so as to further generate a corresponding transition frame image according to the residual information.
The frame-interpolation synthesis model is trained to a convergence state in advance, so that the capability of generating residual error information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the first reference frame image and the second reference frame image is learned.
The image feature vectors of the first reference frame image and the second reference frame image may be extracted by using a convolutional neural network model pre-trained to a convergence state, for example, a related convolutional neural network model based on a Resnet architecture or a neural network model based on an EfficientNet architecture, as long as the feature representation of the image can be achieved.
The image feature vectors extracted from the two reference frame images and the optical flow prediction vectors are input to the interpolation synthesis model together with the original images of the two reference frame images to calculate residual information. Therefore, it can be understood that the interpolation synthesis model can also be implemented by using a neural network model based on residual convolution, for example, based on the Resnet architecture.
The frame interpolation synthesis model may calculate residual information of the transition frame image according to the two image feature vectors and the corresponding optical flow prediction vectors thereof, where the residual information includes residual values and image mapping weights. The residual value is also a vector, which is used to represent the image difference value of the transition frame image calculated according to the optical flow prediction vector relative to the two reference frame images, and the image mapping weight corresponds to the binarized mask image generated by the interpolation frame synthesis model, that is, the interpolation frame synthesis model calculates the mask image after optical flow mapping in the transition frame image according to the image feature vectors of the two reference frame images and the optical flow prediction vector, and the binarized pixel value corresponding to the mask image can be used as the image mapping weight for subsequently generating the transition frame image.
The frame interpolation synthesis model can be realized by adopting a network architecture based on an image segmentation principle, for example, an Unet series, mask maps corresponding to multiple scales are obtained by down-sampling and up-sampling step by step on the basis of synthesizing image feature vectors and corresponding optical flow prediction vectors of two reference frame maps, and finally, the mask maps are connected in a full mode to obtain the final output mask map forming the image mapping weight. It should be understood that the Unet architecture also applies the residual principle.
Step S1400, the two reference frame images are referred to by a pre-trained frame insertion synthesis model, the transition frame image is generated according to the vectors and the residual error information, and the transition frame image is inserted between the two reference frame images for playing:
and finally, the frame-inserting synthesis model takes the two reference frame images as references for restoring the image content of the transition frame image, calculates the image content of the transition frame image relative to the two reference frame images according to the mask images in the residual error information, and then superposes the residual error values in the residual error information to obtain the transition frame image.
The transition frame image obtained by the calculation is inserted between the first reference frame image and the second reference frame image for playing. By taking the method as a basic process, all the steps of the method are circularly executed along the time domain of the target video, and frame interpolation is carried out on two continuous reference frame images, so that the image quality of the target video can be improved, and the playing effect of the target video is smoother and smoother.
The method comprises the steps of utilizing every two time-domain continuous reference frame images in a target video to predict a transition frame image relative to optical flow prediction vectors of the two reference frame images, determining residual error information corresponding to the transition frame image by the optical flow prediction vectors and image feature vectors of the two reference frames, then generating the transition frame image between the two reference frame images by referring to the two reference frame images according to the residual error information, inserting the transition frame image between the two reference frame images for playing, and completing the frame inserting process. The process is based on an end-to-end mechanism, frame interpolation can be achieved only by inputting a target video, a transition frame image is generated according to residual error information in the frame interpolation process, the residual error information can be generated according to the character, namely an optical flow prediction vector, and the residual error information is calculated more efficiently.
Referring to fig. 2, in a further embodiment, the step S1100 of obtaining a target video to be frame-interpolated, and extracting two reference frame images that are consecutive in a time domain in the target video includes the following steps:
step S1110, acquiring frame rate data of a video to be played:
and when the user terminal plays the video to be played through the instruction, reading the frame rate data of the video to be played. The video to be played in the embodiment includes a network video sent by the server and a local video stored in the local storage space of the user terminal.
Step S1120, comparing the frame rate data with a predetermined frame rate threshold:
the method comprises the steps of comparing the obtained frame rate data with a preset frame rate threshold, wherein the numerical value setting of the frame rate threshold can be set according to the lowest standard of a video playing frame rate and also can be set according to the original video frame rate of a video to be played.
Step S1130, when the frame rate value represented by the frame rate data is smaller than the frame rate threshold, determining that the video to be played is the target video:
and when the frame rate value represented by the frame rate data is less than the frame rate threshold, determining that the video to be played is the target video needing frame interpolation operation. And when the frame rate value represented by the frame rate data is greater than or equal to the frame rate threshold, determining that the video to be played does not need to be subjected to interpolation processing.
In some embodiments, when a pause appears in a playing video, a video in a time period in which the pause video is located is intercepted as a target video, and frame insertion processing is performed on the target video, so that a video pause phenomenon is eliminated.
In some embodiments, the frame interpolation model includes a motion vector network model for extracting motion vectors of the first reference frame picture and the second reference frame picture.
Step S1140, extracting two reference frame images along the time domain of the target video for frame interpolation:
after the target video is determined, the subsequent steps of the method can be executed along the time domain to perform frame interpolation processing. The two-by-two reference frame images refer to two consecutive video frames of the target video in the time domain, so that it can be understood that each video frame is used as a second reference frame image to perform associated frame interpolation with a first reference frame image which is adjacent to the second reference frame image in the time domain in advance, and is also used as a first reference frame image to perform associated frame interpolation with a second reference frame image which is adjacent to the first reference frame image in the time domain in the later.
According to the method and the device, whether frame insertion operation is started for the video to be played is determined by identifying the frame rate data of the video to be played, a decision whether frame insertion is started can be made in a self-adaptive mode according to network transmission conditions and video quality, intelligent image enhancement processing is automatically carried out on the video stream, the phenomenon of low-quality playing such as unsmooth and unsmooth image blocking is effectively eliminated, and therefore user experience is improved.
Referring to fig. 3, in a further embodiment, the step S1200 of calculating an optical flow prediction vector of a transition frame image between two reference frame images relative to the two reference frame images by using a pre-trained optical flow prediction model includes the following steps:
step S1210, generating a superimposed image after superimposing the channel images of the two reference frame images:
and performing pixel superposition on the first reference frame image and the second reference frame image, wherein the image sizes of the first reference frame image and the second reference frame image are adjusted to be consistent when the pixels are superposed, the two reference frame images are respectively split into three color channels according to RGB colors, the three color channels are respectively a red channel, a green channel and a blue channel, then, the channel colors are taken as categories, the images in the same category are subjected to weighted superposition, and after the three channels are respectively superposed, the superposed three channel images are combined to generate a superposed image.
And inputting the superposed images into an optical flow prediction model, wherein the optical flow prediction model is a convolution neural network model which is trained to be convergent in advance and used for extracting motion vectors between the images.
In some embodiments, the optical flow prediction model uses the following models: u-net network model. The U-net network structure includes two symmetric parts: the former part of the network is the same as the common convolution network, and uses convolution and pooling downsampling of 3x3, and can grasp the context information in the image; the rear part of the network is basically symmetrical to the front part, and a 3x3 deconvolution layer and upsampling are used to achieve the purpose of output image segmentation. In addition, feature fusion is used in the network, and features of a down-sampling network at the front part and features of an up-sampling part at the back part are fused to obtain more accurate context information, so that a better segmentation effect is achieved. In some embodiments, the optical flow prediction model can also be a U2-net network model. Typically, the optical flow prediction model can be implemented by using a Flownet, which also applies the U-net architecture and is more suitable for the frame interpolation process.
In some embodiments, the optical flow prediction model can also be (without limitation): a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model, or a variant of the above neural network models.
Step S1220, performing convolution pooling on the superimposed image by the convolution layer to generate a downsampled feature:
after the superimposed image is input into the optical flow prediction model, the convolution layer in the optical flow prediction model performs convolution and pooling on the superimposed image, and extracts the down-sampling feature in the superimposed image.
Step S1230, performing interpolation processing on the downsampled feature by the deconvolution layer to generate an upsampled feature:
after the feature extraction and the reduction of the superimposed image are carried out through the convolutional layer, the optical flow prediction model carries out interpolation processing on the reduced image through a deconvolution layer which is symmetrical to the convolutional layer, the up-sampling feature of the superimposed image is simultaneously extracted in the process of the interpolation processing, the processing process is up-sampling, and the image feature is extracted in the process of the up-sampling in an interpolation processing mode and the reduced superimposed image is enlarged.
Step S1240, carrying out feature fusion and superposition on the downsampling features and the upsampling features to generate optical flow prediction vectors of the transition frame image relative to the two reference frame images:
after convolution and deconvolution are carried out on the optical flow prediction model, downsampling features and upsampling features of the superposed images are generated, then the downsampling features and the upsampling features are fused and superposed, and the fusion and superposition process is to carry out weighting on corresponding features of the convolution and deconvolution images to obtain a fused motion vector.
Specifically, the optical flow prediction model includes: a first convolutional layer, a second convolutional layer, a third convolutional layer, a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer. The first convolution layer and the first deconvolution layer are symmetrical to each other, the second convolution layer and the second deconvolution layer are symmetrical to each other, and the third convolution layer and the third deconvolution layer are symmetrical to each other. After the first convolution layer extracts the features of the superposed image, the extracted features are synchronized to a second convolution layer and a first deconvolution layer, after the second convolution layer extracts the features, the extracted features are synchronized to a third convolution layer and a second deconvolution layer, and by analogy, after the superposed image passes through a U-shaped convolution layer extraction path, the third deconvolution layer outputs an optical flow prediction vector finally. In this process, in the process of extracting the features of the first deconvolution layer, the second deconvolution layer and the third deconvolution layer, the features synchronized by the previous convolution layer and the features synchronized by the convolution layer corresponding to the previous convolution layer can be received, so that the features of the downsampling network and the features of the later upsampling part are fused to obtain more accurate context information.
After the optical flow prediction vectors of the first reference frame image and the second reference frame image are obtained by the optical flow prediction model, the optical flow prediction vectors are input into the frame interpolation synthesis model for further processing.
In the embodiment, the pre-trained optical flow prediction model extracts the optical flow prediction vectors of the transition frame relative to the two reference frame images in the process of performing down-sampling and up-sampling on the two reference frame images, thereby laying a foundation for subsequent frame interpolation.
Referring to fig. 4, in a further embodiment, in the step S1300, the step of generating residual information of the transition frame map by a pre-trained frame-interpolation synthesis model according to the optical flow prediction vectors and the image feature vectors of the two reference frame maps includes the following steps:
step S1310, extracting image feature vectors of the two reference frame images by the pre-trained image feature extraction model:
in this embodiment, the corresponding image feature vectors of the two reference frame images may be extracted first, and the image feature vectors may be implemented by using a pre-trained image feature extraction model. The image feature extraction model is also realized based on a convolutional neural network, and is preferably constructed by adopting a model based on a residual error architecture. As will be appreciated by those skilled in the art.
Step S1320, calculating the residual value corresponding to the transition frame image according to the two image feature vectors and the corresponding optical flow prediction vectors by the frame interpolation synthesis model:
on the basis of obtaining the corresponding image feature vectors of the first reference frame image and the second reference frame image, the optical flow prediction vectors of the transition frame image relative to the two reference frame images are also obtained, so that residual information required for generating the transition frame image can be calculated by adopting a residual convolution neural network model. It will be appreciated that this network model has been adapted by training to calculate residual values between the transition frame map and the two reference frame maps.
Step S1330, synthesizing a mask map for representing the image mapping weight by using the frame interpolation synthesis model with reference to the two reference frame maps:
the frame interpolation synthesis model can also be realized by using the U-net architecture similarly, and accordingly, it can be understood that after the superimposed images of two reference frame images are fused with the optical flow prediction vector to perform multi-scale down-sampling and up-sampling, mask images of multiple scales are obtained, and on this basis, the mask images of different scales are fully connected to obtain a mask image finally corresponding to the transition frame image, and the mask image is used subsequently, so that the transition frame image is finally generated for frame interpolation.
In this embodiment, the frame-interpolation synthesis model performs residual calculation on the output according to other models to obtain residual information, that is, the residual value and a mask map, required for subsequently generating the transition frame map, where the mask map is actually a pixelation result of the corresponding image mapping weight, and the image mapping weight may be subsequently used as a hyper-parameter for smoothing in two reference frame maps to generate the corresponding transition frame map. In the embodiment, the calculation of the residual information corresponding to each output information is realized based on the residual convolution neural network model, the corresponding network architecture is easier to train and converge, the convergence speed of the model is higher, and the required residual information can more accurately express the relevant information required by the generation of the transition frame image.
In the embodiment shown in fig. 5, in step S1400, the two reference frame images are referred to by a pre-trained frame-insertion synthesis model, the transition frame image is generated according to the vectors and the residual information, and the transition frame image is inserted between the two reference frame images for playing, which includes the following steps:
step S1410, performing corresponding image transformation on the two reference frame images according to the optical flow prediction vector, to obtain two mapping frame images:
first, two map frames can be generated by means of image transformation, please refer to the following formula:
warp 0 =warp(I 0 ,F 0 )
warp 1 =warp(I 1 ,F 1 )
wherein, I 0 And I 1 Respectively representing a first reference frame image and a second reference frame image, F 0 And F 1 Respectively representing optical flow prediction vectors, warp, of the transition frame map relative to the first and second reference frame maps 0 And warp 1 I.e. two mapping frame images obtained by the transition frame image relative to the first reference frame image and the second reference frame image.
Step S1420, smoothly synthesizing the two mapping frame images with the image mapping weights as hyper-parameters to obtain a fused frame image:
and smoothly combining the two mapping frame images by using the weighting data corresponding to the mask image, namely the image mapping weight as a hyper-parameter, on the basis of obtaining the two mapping frame images, thereby generating a fused frame image corresponding to the transition frame image. Please refer to the following formula:
merge=mask×warp 0 +(1-mask)×warp 1
as can be seen from the formula, merge uses the mask map as a hyper-parameter and maps warp to two mapping frames 0 And warp 1 As a result of performing the smooth weighted summation, it can be understood that each weight value corresponding to the image mapping weight in the mask map is normalized to a range of 0 to 1 in advance, and thus it can be smoothly synthesized as a hyper-parameter.
Step S1430, overlapping the fusion frame image with the residual value to obtain a transition frame image:
finally, the transition frame image can be obtained by vector addition of the residual error value res obtained by the fusion frame image and the frame interpolation synthetic model, and the formula is as follows:
I t =merge+res
therefore, it can be understood that, in the present embodiment, residual information obtained according to two reference frame images and corresponding optical flow prediction vectors thereof, including residual values and mask images, is used, and pixel values of the mask images are used as weight parameters for smoothing the two reference frame images, so as to obtain a fused frame image in which the residual values are superimposed, and finally generate the transition frame image, and a construction process of the whole transition frame image is provided, it can be seen that, the generation of the transition frame image is performed by using a residual principle, which is different from the prior art, because of image-based residual calculation, accuracy and fineness of an interpolated frame are improved; meanwhile, the frame interpolation image is directly output, so that the calculation time of network post-processing is reduced, and in general, the transition frame image required by the frame interpolation can be generated more quickly and accurately, so that the playing effect of the target video after the frame interpolation is smoother and more natural.
Referring to fig. 6, in an expanded embodiment, a network architecture as shown in fig. 7 is used to train each relevant model of the present application, and the network architecture includes a pre-trained optical flow calculation model for calculating optical flow real vectors of two reference frame images, an optical flow prediction model for predicting optical flow prediction vectors of the two reference frame images, a pre-trained image feature extraction model for extracting image feature vectors of the two reference frame images, and an interpolation synthesis model for performing interpolation according to the output of each model. Wherein the optical flow calculation model functions as a teacher network for guiding the training of the optical flow prediction model.
According to the network architecture, the optical flow calculation model and the frame interpolation synthesis model are jointly trained by creating a training task, and the training process comprises the following steps:
step S2100, performing framing processing on a sample video acquired in advance to generate a sample atlas, where the sample atlas includes: two training frame images and a sample frame image, wherein the sample frame image is located in a time interval corresponding to the two training frame images:
samples for model training should be prepared first. In this embodiment, the preparation process of the training sample is as follows: the method comprises the steps of collecting a sample video for model training, and performing framing processing on the sample video, wherein the framing processing is to split the sample video into a plurality of frame images distributed according to a time axis. And packaging the sequence frame images after the framing processing into a sample set according to 4 pieces, wherein each piece of packaged data is called a sample image set. However, the composition of the sample atlas is not limited to this, and according to different application scenarios, in some embodiments, consecutive 3, 5, 6, or more frame images in the sequence frame image are packed into the sample atlas.
The sample atlas includes: the frame image training device comprises a first training frame image, a second training frame image and a sample frame image, wherein the sample frame image is randomly selected within a time interval represented by the first training frame image and the second training frame image. Specifically, frame images in the sample image set located in the first sequence and the last sequence are selected as a first training frame image and a second training frame image, and one frame image is randomly selected from the remaining frame images as a sample frame image. Therefore, the first training frame image is the first reference frameFIG. I 0 The second training frame image is a second reference frame image I 1
For example, in some embodiments, original frames of a sample video are extracted, then the extracted images are stored according to a sequence order of video playing, the extracted images are scaled to have a resolution of 256 pixels wide and 256 pixels high, finally the sequence images are packed according to a group of 4 frames (Frame 0, frame1, frame2, and Frame 3), and in a training process, the middle 1 Frame (Frame 1 and Frame 2) can be arbitrarily selected as a sample Frame image I t Frame0 and Frame3 are respectively used as a first training Frame image and a second training Frame image, so that a sample atlas (I) is obtained 0 ,I t ,I 1 )。
In some embodiments, in order to enhance the robustness of the frame interpolation model, image enhancement processing needs to be performed on the first training frame image and the second training frame image, and the manner of enhancement processing includes (without limitation): random cutting, random rotation of direction, random noise addition, regularization processing and the like.
A plurality of sample atlas can form a sample library, and a K-fold cross-validation method can be applied in the training process to make the sample atlas in the sample library according to 9:1, wherein the test sets are alternately replaced in the sample atlas, so that each sample atlas is enjoyed with the chance of serving as a test set member once, and the K-fold cross-validation method is favorable for reducing the dependence on the sample atlas and promoting the rapid convergence of each relevant model in the network architecture of the application.
Step S2200, inputting the two training frame images into a pre-trained to convergence state optical flow calculation model to calculate an optical flow true vector thereof:
when the sample atlas is used for model training, the first training frame image and the second training frame image are overlapped and input into the optical flow calculation model, wherein image overlapping refers to the fact that weighting operation is conducted on pixel points of corresponding points of the first training frame image and the second training frame image.
And inputting the combined first training frame image and the second training frame image into an optical flow calculation model. The optical flow calculation model is a model which is pre-trained and is suitable for calculating an optical flow real vector between two training frame images, and is a convolution neural network model which is used for extracting a motion vector between images as the optical flow prediction model of the application.
The optical flow calculation model can be realized based on a U-net network model, a U2-net network model, a convolution neural network model, a deep convolution neural network model, a circulation neural network model or a variant model of the neural network model, and the recommended known model can be realized by using Flownet.
Step S2300, inputting the two training frame images into the trained optical flow prediction model to calculate optical flow prediction vectors of the transition frame image with respect to the two training frame images:
and in the same way as the previous step, after image superposition is carried out on the first training frame image and the second training frame image, the first training frame image and the second training frame image are input into a trained optical flow prediction model for optical flow prediction, and a corresponding optical flow prediction vector is obtained. The optical flow prediction model may employ the same network architecture as the optical flow computational model.
It can be understood that both the optical flow calculation model and the optical flow prediction model can obtain the loss value of the loss function, the mean square error between the two loss values is the loss difference value between the two loss values, and subsequently, when the network architecture of the application is propagated reversely and subjected to gradient updating, the loss difference value is minimized, so that the prediction capability of the optical flow prediction model on the optical flow continuously approaches to the true value, therefore, the optical flow calculation model plays a role in guiding the optical flow prediction model to train in the process, and the optical flow prediction model is helped to realize rapid convergence.
Step S2400, inputting the two training frame images into an image feature extraction model pre-trained to a convergence state to obtain two corresponding image feature vectors:
the image feature extraction model may adopt a plurality of known models pre-trained to a convergence state, please refer to the disclosure of the previous embodiments, which is not repeated herein. On the basis, the image feature extraction model is utilized to extract the image feature vectors of the two training frame images respectively.
Step S2500, inputting two training frame images, image characteristic vectors thereof and optical flow prediction vectors into the trained frame interpolation synthesis model to calculate residual error information, and obtaining corresponding transition frame images:
for the working principle of the frame-interpolation synthesis model, reference may be made to what has been disclosed in the foregoing exemplary embodiments. In this embodiment, the frame-interpolation synthesis model is a trained model, and corresponding residual information can be calculated for the image feature vectors and corresponding optical flow prediction vectors corresponding to the two training frame images. And on the basis of obtaining residual error information, performing image transformation by referring to the first training frame image, the second training frame image and corresponding optical flow prediction vectors thereof to obtain a fusion frame image, and then superposing the residual error value to obtain a transition frame image.
Step S2600, calculating a loss value between the transition frame image and the sample frame image according to a preset loss function, and continuing iterative training when the loss value is greater than a preset loss threshold, where the loss value is a weighted sum of a plurality of difference values, and the plurality of difference values include: a loss difference between an optical flow prediction vector and an optical flow real vector, a mean square error of semantic features between the sample frame image and the transition frame image, an absolute error between the sample image and a mapping frame image calculated from the residual information:
after the frame interpolation synthesis model obtains the transition frame map, the loss values of the two frame interpolation models can be calculated by associating with the sample frame maps in the sample map set, and more specifically, the loss value is obtained by the weighted sum of the following components:
loss_l1=MSE(I t ,gt)
loss_mask=sum|merge-gt|
loss=α×loss_flow+β×loss_l1+γ×loss_mask
wherein, the loss _ flow is the loss difference value between the optical flow prediction vector and the optical flow real vector, the loss _ l1 is the mean square error of the semantic features between the sample frame image and the transition frame image, and the loss _ mask is the sample image and the residual error informationAbsolute error between calculated map frame maps, I t Is a sample frame image, I t For the transition frame map, α, β, and γ are weights corresponding to the respective loss differences, which can be determined by one skilled in the art in combination with a priori knowledge or actual measurement.
In the iterative training process, an SGD optimizer is used for setting the initial learning rate to be 1e-4, and the training is stopped after the loss value is not reduced and is iterated to about 200 × 15000 steps. Obtaining the parameters of the neural network can be regarded as that each corresponding model in the network architecture of the application is trained to a convergence state, so that the model can be put into production.
In this embodiment, a preset optical flow calculation model is used, and the optical flow real vector is calculated as a reference value and added when the data set is created. In the process of network training, the optical flow loss of the part is added in the calculation of the loss value of the whole network architecture as the supervision of the optical flow prediction vector calculated by the optical flow prediction model, so that the accuracy of the optical flow prediction of the network architecture is improved; meanwhile, the optical flow calculation model is not used in the process of reasoning after being put into production, so that the running time of actual frame interpolation is reduced, and the running efficiency of frame interpolation is ensured.
Therefore, the network architecture is comprehensively trained, and it can be seen that in the embodiment, a brand-new training idea is adopted to build the network architecture and train the relevant models, the relevant models can be enabled to be converged more quickly in the training process, and the models obtained by training can be utilized to perform high-quality frame insertion images, so that the target videos are smoother during playing, and the user experience is improved.
The method is suitable for application scenes such as network video live broadcast, network video playing and the like, the developed plug-in of the technical scheme can be integrated in the video transmission assembly, and in real-time audio and video transmission, when the network condition is unstable and the bandwidth is reduced, the plug-in is used for supplementing the frame rate and improving the silkiness of the picture; or the plug-in is applied to video post-production, video slow motion enhancement and the like. Through actual measurement, when the method is applied to real-time stream transmission, such as a live broadcast scene, the fluency of live broadcast can be ensured in a weak network environment; and under the condition that the original frame rate of the video stream is low, the method can be used for improving the frame rate. It can be understood that the application of the application has better application value in the scene containing large motion.
Referring to fig. 8, a video frame interpolation apparatus provided for one of the purposes of the present application is a functional implementation of the video frame interpolation method of the present application, and the apparatus includes: the video interpolation method comprises a reference module 1100, an optical flow prediction module 1200, a residual error generation module 1300 and an interpolated frame synthesis module 1400, wherein the reference module 1100 is configured to acquire a target video to be processed by an interpolated frame and extract two reference frame images which are continuous in a time domain in the target video; the optical flow prediction module 1200 is configured to calculate, from a pre-trained optical flow prediction model, an optical flow prediction vector of a transition frame image between two reference frame images relative to the two reference frame images; the residual generating module 1300 is configured to generate residual information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the two reference frame images by using a pre-trained frame-inserted synthesis model, where the residual information includes a residual value and an image mapping weight; the frame-interpolation synthesizing module 1400 is configured to refer to the two reference frame images by a pre-trained frame-interpolation synthesizing model, generate the transition frame image according to the vectors and the residual information, and insert the transition frame image between the two reference frame images for playing.
In a further embodiment, the reference module 1100 comprises: the frame rate acquisition submodule is used for acquiring frame rate data of a video to be played; the frame rate comparison submodule is used for comparing the frame rate data with a preset frame rate threshold; the video determining submodule is used for determining the video to be played as the target video when the frame rate value represented by the frame rate data is smaller than the frame rate threshold value; and the frame interpolation starting submodule is used for extracting two reference frame images along the time domain of the target video so as to perform frame interpolation processing.
In a further embodiment, the optical flow prediction module 1200 comprises: the channel merging submodule is used for generating a superimposed image after the two reference frame images are subjected to channel image superimposition; the convolution pooling submodule is used for performing convolution pooling processing on the superposed image through a convolution layer to generate down-sampling characteristics; the deconvolution pooling submodule is used for carrying out interpolation processing on the down-sampling features through a deconvolution layer to generate up-sampling features; and the prediction generation sub-module is used for performing feature fusion and superposition on the down-sampling features and the up-sampling features to generate an optical flow prediction vector of the transition frame image relative to the two reference frame images.
In a further embodiment, the residual generating module 1300 comprises: the feature extraction sub-module is used for extracting image feature vectors of the two reference frame images by a pre-trained image feature extraction model; the residual error calculation submodule is used for calculating the residual error value corresponding to the transition frame image according to the two image feature vectors and the corresponding optical flow prediction vectors by the interpolation frame synthesis model; and the information output submodule is used for synthesizing a mask image for representing the image mapping weight by using the two reference frame images as reference through the frame insertion synthesis model.
In a further embodiment, the frame-interpolation synthesis module 1400 includes: the image transformation submodule is used for respectively carrying out corresponding image transformation on the two reference frame images according to the optical flow prediction vector to obtain two mapping frame images; the smooth synthesis sub-module is used for carrying out smooth synthesis on the two mapping frame images by taking the image mapping weight as a hyper-parameter to obtain a fusion frame image; and the fusion generation submodule is used for superposing the fusion frame image on the residual value to obtain a transition frame image.
In an extended embodiment, the optical flow computation model and the frame-interpolation synthesis model are jointly trained, and the training device comprises: the system comprises an atlas generation module, a sampling atlas generation module and an atlas database module, wherein the atlas generation module is used for performing framing processing on a sample video acquired in advance to generate a sample atlas, and the sample atlas comprises: the time interval of the two training frame images is within a time interval corresponding to the two training frame images; the optical flow calculation module is used for inputting the two training frame images into an optical flow calculation model which is pre-trained to a convergence state to calculate an optical flow real vector of the two training frame images; an optical flow prediction module 1200, configured to input the two training frame images into the trained optical flow prediction model to calculate optical flow prediction vectors of the transition frame image with respect to the two training frame images; the feature extraction module is used for inputting the two training frame images into an image feature extraction model which is pre-trained to a convergence state to obtain two corresponding image feature vectors; the comprehensive generation module is used for inputting the two training frame images, the image characteristic vectors and the optical flow prediction vectors thereof into the trained frame interpolation synthetic model to calculate residual error information so as to obtain corresponding transition frame images; a gradient updating module, configured to calculate a loss value between the transition frame image and the sample frame image according to a preset loss function, and continue iterative training when the loss value is greater than a preset loss threshold, where the loss value is a weighted sum of a plurality of difference values, where the plurality of difference values include: loss difference between an optical flow prediction vector and an optical flow real vector, mean square error of semantic features between the sample frame image and the transition frame image, and absolute error between the sample image and a mapping frame image calculated according to the residual information.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 9, the internal structure of the computer device is schematic. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and when the computer readable instructions are executed by a processor, the computer readable instructions can enable the processor to realize a video frame insertion method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the video framing method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 8, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the video framing device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application further provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the video framing method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method according to any embodiment of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, the method and the device have the advantages that the effect of frame insertion of the end-to-end target video to improve the video display quality of the end-to-end target video is achieved, the corresponding model carries out gradient updating on the model in the training stage by means of loss values between the optical flow real vectors of the sample frame images and the optical flow prediction vectors determined according to the two reference frame images and the difference values between the images generated in the middle, the model is easier to train to be convergent, the accuracy degree of the model for the optical flow prediction can be improved, the display quality of the target video inserted into the transition frame is improved, and the application prospect is wide.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (9)

1. A method for video frame interpolation, comprising:
acquiring a target video to be subjected to frame interpolation processing, and extracting two reference frame images which are continuous in a time domain from the target video;
calculating optical flow prediction vectors of a transition frame image between the two reference frame images relative to the two reference frame images by using a pre-trained optical flow prediction model;
generating residual error information of the transition frame images by a pre-trained frame-interpolation synthetic model according to the optical flow prediction vectors and the image feature vectors of the two reference frame images, wherein the residual error information comprises residual error values and image mapping weights;
and referencing the two reference frame images by a pre-trained frame-insertion synthetic model, generating the transition frame image according to the optical flow prediction vector and the residual error information, and inserting the transition frame image between the two reference frame images for playing.
2. The video frame interpolation method of claim 1, wherein the steps of obtaining a target video to be frame interpolated and extracting two reference frame images which are continuous in a time domain in the target video comprise:
acquiring frame rate data of a video to be played;
comparing the frame rate data with a preset frame rate threshold;
when the frame rate value represented by the frame rate data is smaller than the frame rate threshold value, determining that the video to be played is the target video;
and extracting two reference frame images along the time domain of the target video to perform frame interpolation processing.
3. The method of claim 1, wherein the computing of the optical flow prediction vectors of the transition frame image between two reference frame images relative to the two reference frame images from the pre-trained optical flow prediction model comprises the steps of:
generating a superimposed image after superimposing the channel images of the two reference frame images;
performing convolution pooling on the superposed image through a convolution layer to generate down-sampling features;
interpolating the downsampling features through the deconvolution layer to generate upsampling features;
and performing feature fusion and superposition on the downsampling features and the upsampling features to generate optical flow prediction vectors of the transition frame image relative to the two reference frame images.
4. The method of claim 1, wherein generating residual information of the transition frame map from the optical flow prediction vectors and the image feature vectors of the two reference frame maps by a pre-trained frame-insertion synthesis model comprises:
extracting image feature vectors of the two reference frame images by a pre-trained image feature extraction model;
calculating the residual value corresponding to the transition frame image by the interpolation frame synthesis model according to the two image feature vectors and the corresponding optical flow prediction vectors;
and synthesizing a mask map for representing the image mapping weight by using the two reference frame maps as references by the frame insertion synthesis model.
5. The method of claim 1, wherein the two reference frame maps are referred to by a pre-trained frame-insertion synthesis model, the transition frame map is generated according to the optical flow prediction vector and the residual information, and the transition frame map is inserted between the two reference frame maps for playing, comprising the following steps:
respectively carrying out corresponding image transformation on the two reference frame images according to the optical flow prediction vector to obtain two mapping frame images;
smoothly synthesizing the two mapping frame images by taking the image mapping weight as a hyper-parameter to obtain a fusion frame image;
and overlapping the fused frame image with the residual value to obtain a transition frame image.
6. The method of any of claims 1-5, wherein the optical flow prediction model and the frame-interpolation synthesis model are jointly trained, and the training process comprises the following steps:
performing framing processing on a pre-acquired sample video to generate a sample atlas, wherein the sample atlas comprises: the time interval of the two training frame images is within a time interval corresponding to the two training frame images;
inputting the two training frame images into an optical flow calculation model which is pre-trained to a convergence state to calculate an optical flow real vector of the two training frame images;
inputting the two training frame images into the trained optical flow prediction model to calculate optical flow prediction vectors of the transition frame images relative to the two training frame images;
inputting the two training frame images into an image feature extraction model which is pre-trained to a convergence state to obtain two corresponding image feature vectors;
inputting the two training frame images, the image characteristic vectors and the light stream prediction vectors thereof into the trained frame interpolation synthetic model to calculate residual error information so as to obtain corresponding transition frame images;
calculating a loss value between the transition frame image and the sample frame image according to a preset loss function, continuing iterative training when the loss value is greater than a preset loss threshold, wherein the loss value is a weighted sum of a plurality of difference values, and the plurality of difference values comprise: loss difference between optical flow prediction vector and optical flow real vector, mean square error of semantic feature between the sample frame image and the transition frame image, and absolute error between the sample frame image and mapping frame image calculated according to the residual information.
7. A video frame interpolation apparatus, comprising:
the reference module is used for acquiring a target video to be subjected to frame interpolation processing and extracting two reference frame images which are continuous in a time domain in the target video;
the optical flow prediction module is used for calculating optical flow prediction vectors of a transition frame image between the two reference frame images relative to the two reference frame images by using a pre-trained optical flow prediction model;
a residual error generation module, configured to generate residual error information of the transition frame image according to the optical flow prediction vector and the image feature vectors of the two reference frame images by using a pre-trained frame-inserted synthesis model, where the residual error information includes a residual error value and an image mapping weight;
and the frame interpolation synthesis module is used for referencing the two reference frame images by a pre-trained frame interpolation synthesis model, generating the transition frame image according to the optical flow prediction vector and the residual error information, and inserting the transition frame image between the two reference frame images for playing.
8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that it stores a computer program implemented according to the method of any one of claims 1 to 6 in the form of computer-readable instructions, which, when invoked by a computer, performs the steps comprised by the corresponding method.
CN202111267436.3A 2021-10-29 2021-10-29 Video frame insertion method and device, equipment, medium and product thereof Active CN114007135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111267436.3A CN114007135B (en) 2021-10-29 2021-10-29 Video frame insertion method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111267436.3A CN114007135B (en) 2021-10-29 2021-10-29 Video frame insertion method and device, equipment, medium and product thereof

Publications (2)

Publication Number Publication Date
CN114007135A CN114007135A (en) 2022-02-01
CN114007135B true CN114007135B (en) 2023-04-18

Family

ID=79924806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111267436.3A Active CN114007135B (en) 2021-10-29 2021-10-29 Video frame insertion method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN114007135B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114640885B (en) * 2022-02-24 2023-12-22 影石创新科技股份有限公司 Video frame inserting method, training device and electronic equipment
CN115457449B (en) * 2022-11-11 2023-03-24 深圳市马博士网络科技有限公司 Early warning system based on AI video analysis and monitoring security protection
CN118573827A (en) * 2023-02-28 2024-08-30 万有引力(宁波)电子科技有限公司 Fusion display method, system and storage medium
CN116886996B (en) * 2023-09-06 2023-12-01 浙江富控创联技术有限公司 Digital village multimedia display screen broadcasting system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110324664A (en) * 2019-07-11 2019-10-11 南开大学 A kind of video neural network based mends the training method of frame method and its model
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN112565653A (en) * 2020-12-01 2021-03-26 咪咕文化科技有限公司 Video frame insertion method, system, electronic equipment and storage medium
CN112804561A (en) * 2020-12-29 2021-05-14 广州华多网络科技有限公司 Video frame insertion method and device, computer equipment and storage medium
CN113365110A (en) * 2021-07-14 2021-09-07 北京百度网讯科技有限公司 Model training method, video frame interpolation method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10776688B2 (en) * 2017-11-06 2020-09-15 Nvidia Corporation Multi-frame video interpolation using optical flow
AU2019451948B2 (en) * 2019-06-18 2023-10-26 Huawei Technologies Co., Ltd. Real-time video ultra resolution

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110324664A (en) * 2019-07-11 2019-10-11 南开大学 A kind of video neural network based mends the training method of frame method and its model
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN112565653A (en) * 2020-12-01 2021-03-26 咪咕文化科技有限公司 Video frame insertion method, system, electronic equipment and storage medium
CN112804561A (en) * 2020-12-29 2021-05-14 广州华多网络科技有限公司 Video frame insertion method and device, computer equipment and storage medium
CN113365110A (en) * 2021-07-14 2021-09-07 北京百度网讯科技有限公司 Model training method, video frame interpolation method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张培健等.基于级联卷积神经网络的轻量级视频插帧算法.微电子学与计算机.2021,第38卷(第3期),第39-45页. *
訾玲玲 ; 丛鑫.一种图像序列的区域导向帧插值算法.小型微型计算机系统.2015,(09),第2120-2124页. *

Also Published As

Publication number Publication date
CN114007135A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN114007135B (en) Video frame insertion method and device, equipment, medium and product thereof
US10970600B2 (en) Method and apparatus for training neural network model used for image processing, and storage medium
WO2022141819A1 (en) Video frame insertion method and apparatus, and computer device and storage medium
CN109003282B (en) Image processing method and device and computer storage medium
JP2022500734A (en) Computer realization method using convolutional neural network, device for synthetic image generation and computer program product
CN111062872A (en) Image super-resolution reconstruction method and system based on edge detection
CN111179167A (en) Image super-resolution method based on multi-stage attention enhancement network
CN112652058B (en) Face image replay method and device, computer equipment and storage medium
CN109903315B (en) Method, apparatus, device and readable storage medium for optical flow prediction
CN112633234B (en) Face glasses model training and application method and device, equipment and medium thereof
CN114092774B (en) RGB-T image significance detection system and detection method based on information flow fusion
US20220124257A1 (en) Generating stylized images in real time on mobile devices
CN114339030B (en) Network live video image stabilizing method based on self-adaptive separable convolution
WO2023091249A1 (en) Neural semantic fields for generalizable semantic segmentation of 3d scenes
GB2620467A (en) Applying object-aware style transfer to digital images
CN115222581A (en) Image generation method, model training method, related device and electronic equipment
CN117036436A (en) Monocular depth estimation method and system based on double encoder-decoder
WO2023010981A1 (en) Encoding and decoding methods and apparatus
CN115049559A (en) Model training method, human face image processing method, human face model processing device, electronic equipment and readable storage medium
WO2022117067A1 (en) Content-aware bifurcated upscaling
CN116071478B (en) Training method of image reconstruction model and virtual scene rendering method
CN117635478B (en) Low-light image enhancement method based on spatial channel attention
CN114119698B (en) Unsupervised monocular depth estimation method based on attention mechanism
US12062152B2 (en) Re-noising and neural network based image enhancement
CN114067174A (en) Editing propagation method and system based on depth similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230801

Address after: No. 79 Wanbo Second Road, Nancun Town, Panyu District, Guangzhou City, Guangdong Province, 5114303802 (self declared)

Patentee after: Guangzhou Huanju Mark Network Information Co.,Ltd.

Address before: 511442 24 / F, building B1, Wanda Plaza, Nancun Town, Panyu District, Guangzhou City, Guangdong Province

Patentee before: GUANGZHOU HUADUO NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right