[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN116233534A - Video processing method and device, electronic equipment and storage medium - Google Patents

Video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116233534A
CN116233534A CN202211666037.9A CN202211666037A CN116233534A CN 116233534 A CN116233534 A CN 116233534A CN 202211666037 A CN202211666037 A CN 202211666037A CN 116233534 A CN116233534 A CN 116233534A
Authority
CN
China
Prior art keywords
shot
video
lens
information
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211666037.9A
Other languages
Chinese (zh)
Inventor
陈泽宇
王欣博
曹翔
黄雅勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202211666037.9A priority Critical patent/CN116233534A/en
Publication of CN116233534A publication Critical patent/CN116233534A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4665Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms involving classification methods, e.g. Decision trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The disclosure provides a video processing method and device, electronic equipment and storage medium. The video processing method comprises the following steps: acquiring a video to be processed; performing a lens splitting operation on the video to be processed to identify a plurality of lenses included in the video to be processed; for each shot of the plurality of shots, determining one or more pieces of lens information corresponding to the shot based on one or more classification models, wherein the one or more classification models are pre-trained; and generating a shot script corresponding to the video to be processed based on the one or more pieces of shot information.

Description

Video processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer vision, and in particular, to a video processing method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.
Background
With the maturation of internet technology and the development of 5G technology, content on the network is gradually visualized. Compared with the traditional image-text content, the method has the advantages that the video is more complicated to manufacture, shooting, editing and the like are generally required according to the shot script, and therefore the expected video is obtained. The shot script is a workbench book for presenting visual images of videos in a text form, is not only a blue book for shooting and editing, but also records information such as creation conception, shooting composition modes and the like of the videos.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a video processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present disclosure, there is provided a video processing method including: acquiring a video to be processed; performing a lens splitting operation on the video to be processed to identify a plurality of lenses included in the video to be processed; for each shot of the plurality of shots, determining one or more pieces of lens information corresponding to the shot based on one or more classification models, wherein the one or more classification models are pre-trained; and generating a shot script corresponding to the video to be processed based on the one or more pieces of shot information.
According to another aspect of the present disclosure, there is also provided a video processing apparatus including: the acquisition module is configured to acquire a video to be processed; the sub-mirror operation module is configured to perform sub-mirror operation on the video to be processed so as to identify a plurality of lenses included in the video to be processed; a determine-lens information module configured to determine, for each of the plurality of shots, one or more pieces of lens information corresponding to the shot based on one or more classification models, wherein the one or more classification models are pre-trained; and a shot script generation module configured to generate a shot script corresponding to the video to be processed based on the one or more pieces of shot information.
According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and at least one memory communicatively coupled to the at least one processor, wherein the at least one memory stores a computer program that when executed by the at least one processor implements the video processing method described above.
According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the above-described video processing method.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the video processing method described above.
In accordance with one or more embodiments of the present disclosure, the sub-mirror information of each shot in the video to be processed is determined using one or more classification models that are pre-trained, and a sub-shot script corresponding to the video to be processed is generated based on the sub-mirror information. Because the classification model for generating the lens splitting information is pre-trained, under the condition that the training sample is sufficient, the accuracy of the obtained lens splitting information can be ensured, and the time cost and the labor cost consumed by manually splitting video lenses, analyzing the lens splitting information of each lens and the like can be greatly relieved, so that the efficiency of video creation and learning is high.
According to another or more embodiments of the present disclosure, the above-described video processing method is not limited to a single video, but may be applied to a complex video composed of different shots and different scenes and generate corresponding sub-shot scripts for the same, thereby improving versatility.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a flow chart of a video processing method according to some embodiments of the present disclosure;
FIG. 2 illustrates a flow chart of a video processing method according to further embodiments of the present disclosure;
FIG. 3 illustrates a flowchart identifying a scene to which each shot belongs, according to some embodiments of the present disclosure;
FIG. 4 illustrates a flow chart of a video processing method according to further embodiments of the present disclosure;
FIG. 5 illustrates a flow chart of a video processing method according to further embodiments of the present disclosure;
FIG. 6 illustrates a flow chart of generating corresponding summary information for each scene included in a video to be processed, according to some embodiments of the present disclosure;
FIG. 7 illustrates a flow chart of a video processing method according to some embodiments of the present disclosure;
fig. 8 illustrates a block diagram of a video processing apparatus according to some embodiments of the present disclosure;
fig. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
Compared with the traditional image-text content, the video production is generally more complex, and shooting, editing and the like are required according to the shot script, so that the expected video is obtained. The shot script is a workbench book for presenting visual images of videos in a text form, is not only a blue book for shooting and editing, but also records information such as creation conception, shooting composition modes and the like of the videos. In some cases, it may be desirable to analyze and learn the creative concept of premium video, take a picture of a composition, or to quickly find one or more videos from a large number of videos that match the intended topic. Therefore, it would be beneficial to be able to quickly generate a shot script corresponding to a video.
The inventors found that in the related art, a shot script corresponding to a video is generally obtained manually, i.e., a video is manually carefully watched and split up shots in the video, shot information (e.g., a photographing method, a composition method, a mirror-moving method, photographing contents, subtitle information, etc.) of each shot is analyzed and recorded on a shot-by-shot basis, and a corresponding shot script is written according to the recorded shot information. However, the video may include more than one shot, and the shot-to-shot information may be greatly different from shot to shot, so that analyzing the video and obtaining the corresponding shot script by a manual manner may result in a significant increase in time cost and labor cost, thereby limiting learning of the quality video and thus reducing efficiency such as subsequent creation of the video.
In view of this, embodiments of the present disclosure provide a video processing method that determines, using one or more classification models that are pre-trained, the shot information of each shot in a video to be processed, and automatically generates a shot script corresponding to the video to be processed based on the shot information. Therefore, the accuracy of the obtained lens division information can be ensured, and the time cost and labor cost consumed by manually dividing the video lens, analyzing the division information of each lens, writing division scripts and the like can be greatly relieved, so that the efficiency of video creation and learning is high.
Meanwhile, the method can also carry out a mirror splitting operation on the video to be processed so as to identify a plurality of lenses included in the video. This may allow the method to be adapted to complex videos composed of various different lens combinations without being limited to a single video, thereby improving versatility, and may also improve accuracy of the obtained split-lens information (since the split-lens information is typically lens-specific).
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a flow chart of a video processing method 100 according to some embodiments of the present disclosure. As shown in fig. 1, the method 100 may include: step S110, acquiring a video to be processed; step S120, performing a lens splitting operation on the video to be processed to identify a plurality of lenses included in the video to be processed; step S130, for each lens in the plurality of lenses, determining one or more pieces of lens information corresponding to the lens based on one or more classification models, wherein the one or more classification models are pre-trained; and step S140, generating a shot script corresponding to the video to be processed based on the one or more pieces of the sub-mirror information.
By determining the shot information of each shot in the video to be processed using the pre-trained classification model or models, a shot script corresponding to the video to be processed may be automatically generated. Therefore, the accuracy of the obtained lens division information can be ensured, and the time cost and labor cost consumed by manually dividing the video lens, analyzing the division information of each lens, writing division scripts and the like can be greatly relieved, so that the efficiency of video creation and learning is high.
In accordance with some embodiments of the present disclosure, in step S110, the stored or cached video to be processed may be read from an appropriate storage device (local and/or remote). Alternatively, the video to be processed may also be received from an external other device via a wired or wireless communication link. The scope of the presently claimed subject matter is not limited in this respect.
The video to be processed may be captured by a camera. The camera may be a stand-alone device (e.g., camera, video camera, etc.) or may be included in various types of electronic equipment (e.g., mobile phone, computer, personal digital assistant, tablet, wearable device, etc.). The camera may be an infrared camera or a visible light camera. The video to be processed may comprise video in any suitable format.
According to some embodiments of the present disclosure, step S120 of performing a mirror splitting operation on a video to be processed to identify a plurality of shots included in the video to be processed may include: framing the video to be processed to obtain a plurality of video frames; for each video frame of a plurality of video frames, determining a first similarity between the video frame and an adjacent video frame; and determining whether the video frame is a boundary video frame of the video to be processed based on the first similarity, wherein the boundary video frame is used for distinguishing adjacent shots in the plurality of shots. By calculating the similarity between adjacent video frames to determine whether each video frame is a boundary video frame, the accuracy of the determined boundary video frame can be improved and human errors can be avoided.
According to some embodiments of the present disclosure, the plurality of video frames that it includes may be obtained by decoding the video to be processed. For example, the video to be processed may be sliced into single video frames using Open CV, FFmpeg, etc. techniques. It will be appreciated that, alternatively, other suitable decoders may be utilized for framing operations, and the scope of the claimed subject matter is not limited in this respect.
According to some embodiments of the present disclosure, the similarity between each video frame and the neighboring video frames may be determined by calculating a frame difference. For example, for a video frame i and an adjacent video frame i+1 having pixel values of 50×50, a difference value between each pixel value in the video frame i and a corresponding pixel value in the video frame i+1 may be calculated, respectively, and then the calculated difference values are summed up in absolute value as a frame difference between the video frame i and the video frame i+1. If the frame difference is less than or equal to a preset threshold, it may be determined that the similarity between video frame i and video frame i+1 is high, and thus it is determined that video frame i is not a boundary video frame. If the frame difference exceeds a preset threshold, it may be determined that the similarity between video frame i and video frame i+1 is low, and thus that video frame i is a boundary video frame.
It will be appreciated that the preset threshold may be any suitable value, such as 50, 100, 200, etc. It will also be appreciated that the value of similarity may be associated with a frame difference, such as a look-up table of values of similarity and frame differences, and that the similarity between each video frame and adjacent video frames is obtained by interpolation.
According to other embodiments of the present disclosure, the similarity between each video frame and the neighboring video frame may be determined by a gray level histogram. For example, the number of pixels of each section of the gray space in which the frame image falls may be calculated as the gray histogram of the video frame for each video frame. In this case, the gray histogram difference of the video frame i and the adjacent video frame i+1 can be calculated by:
Figure BDA0004014741060000051
Wherein n represents a gray scale squareNumber of blocks of the graph, H i (j) Representing the number of pixels of video frame i in the j-th partition. Similar to the above example of calculating the frame difference, if the gray histogram difference is equal to or less than the preset threshold, it may be determined that the similarity between the video frame i and the video frame i+1 is high, and thus it is determined that the video frame i is not a boundary video frame. If the gray histogram difference exceeds a preset threshold, it may be determined that the similarity between video frame i and video frame i+1 is low, and thus that video frame i is a boundary video frame.
It will be appreciated that the preset threshold value herein may also be any suitable value. It will also be appreciated that the value of similarity may similarly be associated with the gray histogram difference, such as a look-up table that predefines and stores the value of similarity with the gray histogram difference, and that the similarity between each video frame and the adjacent video frame is obtained by interpolation.
According to further embodiments of the present disclosure, whether each video frame is a shot boundary video frame may be determined by a TransNet V2, where the TransNet V2 is a neural network-based sequence annotation model. For example, by fusing and encoding a sequence of image features of a plurality of video frames using a neural network, encoded features containing target information (e.g., whether or not a boundary video frame is present) may be generated. Then, the generated features are processed by the linear layer of the neural network to obtain a result (e.g., a 2-classification result) of whether each video frame is a boundary video frame.
According to some embodiments of the present disclosure, step S130, for each of the plurality of shots, determining one or more pieces of sub-mirror information corresponding to the shot based on one or more classification models may include: acquiring a first plurality of lens samples and a plurality of pieces of predetermined lens dividing information; establishing one or more classification models with the first plurality of shot samples as training inputs and with the predetermined plurality of pieces of lens information as training outputs; and classifying the shot based on the one or more classification models, the classification of the shot indicating one or more pieces of partial mirror information corresponding to the shot.
By acquiring a plurality of lens samples and associated lens information to build one or more classification models, automatic classification and analysis at the lens level can be achieved, and thus more accurate lens information for each lens is obtained. In addition, under the condition that the lens sample is sufficient, the accuracy of the result can be further improved by obtaining the lens information through the pre-training classification model.
According to some embodiments of the present disclosure, the one or more classification models may include a scene classification model. In the technical field, a scene may refer to a difference in the size of a range presented in a video recorder of a camera due to a difference in distance between the camera and a subject under a condition that a focal length is fixed. The scenery categories may be categorized as, for example, long-range, panoramic, medium-range, short-range, close-up, etc. For example, a plurality of shot samples may be obtained and it may be determined which of the above classifications of the scene belongs to each shot sample, so as to form a mapping of the plurality of shot samples and the corresponding scene. A scene classification model may be established by pre-training a model based on, for example, a deep neural network with a mapping of the plurality of shot samples to corresponding scenes. Then, the scene classification model may be used to classify each shot in the video to be processed, for example, labeling Jing Bie shot j as a distant shot, labeling j+1 shot as a close shot, and so on.
According to further embodiments of the present disclosure, the one or more classification models may include a specular classification model. In the art, a moving mirror may refer to a movement mode of a camera in the case of photographing a moving lens. The mirror can be classified as stationary, push-pull, shake, lift, follow, etc., for example. For example, a plurality of lens samples may be obtained and it may be determined which of the above classifications each lens sample belongs to in order to form a mapping of the plurality of lens samples to the corresponding lens. A mirror classification model may be established by pre-training a model based on, for example, a deep neural network with a mapping of the plurality of lens samples to corresponding mirrors. Then, the lens classification model can be used for classifying the lens of each lens in the video to be processed, for example, labeling the lens j as stationary, labeling the lens j+1 as ascending and descending, and the like.
According to further embodiments of the present disclosure, the one or more classification models may include a focus classification model. In the art, the focal length may refer to the distance between the lens of the camera and the principal point, which determines the size and distance of the object presented in the recorder of the camera. Focal lengths can be classified into wide, mid, tele, and the like, for example. For example, a plurality of lens samples may be obtained and it may be determined which of the above classifications of focal lengths each lens sample belongs to, to form a mapping of the plurality of lens samples to the corresponding focal length. A focus classification model may be established by pre-training a model based on, for example, a deep neural network with a mapping of the plurality of shot samples to corresponding focus distances. Then, the focal length classification model can be used to classify each lens in the video to be processed, for example, the focal length is labeled for the lens j as wide angle, the focal length is labeled for the lens j+1 as tele, and the like.
According to further embodiments of the present disclosure, the one or more classification models may include a shooting angle classification model. The shooting angle may be classified into, for example, panning, low-angle shooting, high-angle shooting, and the like. For example, a plurality of lens samples may be acquired and it may be determined which of the above classifications each lens sample belongs to form a map of the plurality of lens samples and the corresponding photographing angles. A shooting angle classification model may be established by pre-training a model based on, for example, a deep neural network with a mapping of the plurality of shot samples to corresponding shooting angles. Then, the shooting angle classification model can be used for classifying the shooting angles of each lens in the video to be processed, for example, the shooting angle classification is marked for the lens j as flat shooting, and the shooting angle classification is marked for the lens j+1 as low-angle shooting.
According to further embodiments of the present disclosure, the one or more classification models may include a scene classification model. A scenario may refer to a location where the scenario occurs, such as a restaurant, bedroom, etc. For example, multiple sets of shot samples may be obtained and it may be determined which of the above-described scene classifications each set of shot samples belongs to, to form a mapping of the multiple sets of shot samples to the corresponding scene. A scene classification model may be established by pre-training a model based on, for example, a deep neural network with a mapping of the sets of shot samples to the corresponding scenes. Then, each shot (or each group of shots) in the video to be processed may be scene classified using the scene classification model, e.g., shot 1-shot j belongs to the same scene and classifies the group of shot annotation scenes as restaurant, shot j+1-shot n belongs to another scene, and classifies the group of shot annotation scenes as bedrooms.
It will be appreciated that the classification models described above are shown for illustrative purposes only and not for limitation, and that two or more of these classification models may also be included. Therefore, more complete and comprehensive lens dividing information of each lens can be automatically obtained, and the lens dividing script which meets the industry standard better can be generated.
In some embodiments, each of these classification models may be trained separately to obtain more accurate classification results, as each classification model is specific to determining a sort of mirror-specific message (e.g., a mirror-specific classification model, a focus-specific classification model, etc.). In other embodiments, the training classification models may also be combined such that the trained classification models may be specific to determining more than one type of split-lens message (e.g., lens and focal length may be determined simultaneously using the combined classification models), in which case the amount of training data may be reduced, speeding up the training process. The scope of the presently claimed subject matter is not limited in this respect.
Fig. 2 illustrates a flow chart of a video processing method 200 according to further embodiments of the present disclosure. As shown in fig. 2, method 200 may include: steps S210 to S240 similar to steps S110 to S140 in the video processing method 100 described with reference to fig. 1; and step S250, identifying the scene to which each of the plurality of shots belongs.
Because shots in the shot script are usually arranged in sequence according to scenes, each scene in the video to be processed is identified, and the shots included in the scene are determined, the shot script based on the scene can be generated, so that the shot script can reflect the real creation concept of the video, and the shot script based on the scene accords with the industry standard.
It will be appreciated that although steps S210-S250 are depicted in fig. 2 as being in a particular order, this should not be construed as requiring that the steps must be performed in the particular order shown or in an order that is antegrade. For example, step S250 may be performed before step S230 or in parallel with step S230.
Fig. 3 illustrates a flowchart identifying a scene to which each shot belongs, according to some embodiments of the present disclosure. As shown in fig. 3, identifying the scene to which each of the plurality of shots belongs in step S250 may include: step S251, extracting the image characteristics of each lens; step S252, a plurality of second similarity between the image characteristics of the lens and the image characteristics of each lens is determined; step S253, determining whether the shot is a scene boundary shot of the video to be processed based on a plurality of second similarities, wherein the scene boundary shot is used for distinguishing adjacent scenes in the video to be processed; and step S254, determining a plurality of scenes included in the video to be processed according to the scene boundary shots, and classifying each shot.
According to some embodiments, which shots are scene boundary shots may be determined based on a similarity matrix. For example, for a video to be processed consisting of 100 shots, the image features of shots 1-100 are first extracted, respectively. Then, for each lens, the similarity of the lens to each of the 100 lenses (including the lens itself) is determined by comparing the image features, for example, the similarity of the lens 1 to each of the lenses 1 to 100 is determined by comparing the image features of the lens 1 to the image features of the lenses 1 to 100, respectively, and the similarity of the lens 2 to each of the lenses 1 to 100, … …, the similarity of the lens 100 to each of the lenses 1 to 100 are obtained by analogy. Thus, a similarity matrix of size 100×100 can be obtained, wherein the similarity on the diagonal of the similarity matrix is highest, because the diagonal represents the similarity of each lens to itself. Assuming that these 100 shots belong to 10 scenes, it will be appreciated that since the similarity between shots within the same scene will be significantly higher than the similarity between shots within different scenes, 10 sub-matrix blocks will be formed along the diagonal of the similarity matrix and scene boundary shots can be determined by identifying the demarcation points of the sub-matrices. Each shot in the video to be processed may then be categorized according to the determined scene boundary shots.
Compared with the traditional method that image characteristics of a plurality of shots are taken as a sequence to be input into a specific model, and whether each shot is a scene boundary shot is determined by carrying out sequence labeling on the image characteristic sequence, the method based on the similarity (similarity matrix) is beneficial to eliminating interference of other information in the image (because only the factor of the similarity is considered), so that the scene boundary shot can be determined more accurately and conveniently.
Fig. 4 illustrates a flow chart of a video processing method 400 according to further embodiments of the present disclosure. As shown in fig. 4, method 400 may include: steps S410-S440 similar to steps S110-S140 in the video processing method 100 described with reference to fig. 1 or steps S210-S240 in the video processing method 200 described with reference to fig. 2; and step S250, generating a text description corresponding to each lens in the plurality of lenses.
By generating the corresponding text description for each lens, a learner can more intuitively and rapidly know the creation concept of the video to be processed. Meanwhile, in the case where the authoring concept does not meet the expectations of the learner, time required for the learner to watch the complete video may be saved (for example, the learner may discard the way in which the learner made the video).
It will be appreciated that although steps S410-S450 are depicted in fig. 4 as being in a particular order, this should not be construed as requiring that the steps must be performed in the particular order shown or in an order that is antegrade. For example, step S450 may be performed before step S430 or in parallel with step S430.
According to some embodiments of the present disclosure, step S250, for each of the plurality of shots, generating a textual description corresponding to the shot may include: acquiring a first plurality of lens samples, wherein each lens sample in the first plurality of lens samples has a corresponding lens sample text description; taking the first plurality of shot samples as training input text and corresponding shot sample word descriptions as training output to establish a shot description generation model; and for each of the first plurality of shots, processing the shot as input to a shot description generation model to generate a textual description corresponding to the shot. Therefore, corresponding text descriptions can be automatically generated for each lens, and manual analysis and summarization are avoided, so that time cost and labor cost are greatly saved.
According to some embodiments of the present disclosure, the stored or cached shots may be read from an appropriate storage device. It will be appreciated that each shot of the first plurality of shot samples may belong to the same video (including the same scene of the same video, different scenes of the same video), or each shot of the first plurality of shot samples may belong to different videos, respectively, the scope of the presently claimed subject matter is not limited in this respect.
According to some embodiments of the present disclosure, a manual shot script corresponding to each of the first plurality of shot samples may be obtained, where a textual description of the shot sample is recorded in the manual shot script. Then, a mapping relationship between each shot sample and the associated literal description of the record may be formed. A shot description generation model may be created by pre-training a model based on, for example, a deep neural network using the mapping relationship between each shot sample and the recorded associated literal description. Then, each shot in the video to be processed can be processed by using the shot description generation model so as to obtain a text description corresponding to each shot.
According to some embodiments of the present disclosure, the lens description generation model may be an encoder-decoder based model, wherein the lens description generation model includes an encoder that receives each lens as an input and a decoder that generates a textual description corresponding to each lens as an output. An example of an encoder-decoder may be, for example, a BLIP-based model that supports multi-modal mixing of encoder-decoders to cover a wider downstream task. It will be appreciated that any other suitable encoder-decoder framework may also be employed to build the lens description generation model.
Fig. 5 illustrates a flow chart of a video processing method 500 according to further embodiments of the present disclosure. As shown in fig. 5, method 500 may include: steps S510-S550 similar to steps S210-S250 in the video processing method 200 described with reference to fig. 2; and step S560, generating summary information corresponding to each of a plurality of scenes included in the video to be processed.
By generating corresponding abstract information for each scene, a learner can more intuitively and rapidly know the creation theme of the video to be processed. This allows saving the time required for the learner to view the complete video in the event that the authored subject does not meet the learner's desired subject or subject of interest (e.g., the learner may forgo the way in which the learner was made to learn the video). In addition, by generating abstract information corresponding to the scene, the content of the shot script can be further enriched, so that the shot script is more complete, and the information such as the video creation conception, the shooting composition mode and the like can be reflected more accurately.
It will be appreciated that although steps S510-S560 are depicted in fig. 5 as being in a particular order, this should not be construed as requiring that the steps must be performed in the particular order shown or in an order that is antegrade. For example, steps S550 and S560 may be performed before step S530 or performed in parallel with step S530.
Fig. 6 illustrates a flow chart of generating corresponding summary information for each scene included in a video to be processed according to some embodiments of the present disclosure. As shown in fig. 6, in step S560, for each of a plurality of scenes included in a video to be processed, generating summary information corresponding to the scene may include: step S561, building a abstract generation model; step S562, acquiring all caption information in a shot included in the scene to form a caption information set; and step S563, processing the caption information set as an input of the digest generation model to generate digest information corresponding to the scene. Because the abstract generation model is pre-trained, the abstract information of each scene can be quickly and automatically obtained, and the accuracy of the obtained abstract information is ensured, so that the efficiency of video creation and learning is improved.
According to some embodiments of the present disclosure, step S561, building the summary generation model may include: acquiring a second plurality of shot samples in the same scene, wherein each shot sample in the second plurality of shot samples has corresponding subtitle information; acquiring scene sample abstract information corresponding to the same scene to which the second plurality of shot samples belong; a pre-trained summary generation model is built with corresponding subtitle information as training input and scene sample summary information as training output.
According to some embodiments of the present disclosure, the stored or cached shots may similarly be read from an appropriate storage device. Advantageously, one or more shots adjacent to the same scene may be selected as sub-shot samples in the second plurality of shot samples, so that more continuous subtitle information can be obtained, which is beneficial to generating more accurate summary information.
According to some embodiments of the present disclosure, caption information in each of the second plurality of shot samples may be further acquired. In some examples, one or more text boxes in each shot sample may be obtained by text detection techniques, and then text information in the one or more text boxes is determined as subtitle information for each shot sample by text recognition techniques. In other examples, as described in detail below, the subtitle region in each shot sample may be detected using a deep learning-based character detection model and subtitle information included therein may be obtained therefrom. In still other examples, the audio portion of each shot sample may be extracted and the caption information in the shot sample acquired by an automated speech recognition technique.
According to some embodiments of the present disclosure, similarly, an artificial shot script corresponding to each shot sample may be obtained, where summary information of a scene to which the shot belongs is recorded in the artificial shot script. Then, a mapping relationship between the subtitle information of the shot sample and the recorded associated summary information for each scene may be formed. A summary generation model may be built based on a model such as a deep neural network, pre-trained from the mapping relationship.
According to some embodiments of the present disclosure, the summary generation model may also be an encoder-decoder based model, wherein the summary generation model includes an encoder that receives as input a set of subtitle information for shots in the same scene and a decoder that generates as output corresponding summary information. Similarly, an example of an encoder-decoder may be, for example, a GPT-3 based model that supports multi-modal mixing of encoder-decoders to cover a wider downstream task. It will be appreciated that any other suitable encoder-decoder framework may be employed to build the digest generation model.
According to some embodiments of the present disclosure, step S562, acquiring all caption information in a shot included in the scene to form a caption information set may include: obtaining a plurality of video frames in a shot included in the scene; for each video frame of the plurality of video frames: determining a caption area in the video frame; extracting image features associated with the caption area; decoding the image characteristics to obtain a character recognition result in the subtitle region; and forming a caption information set based on the text recognition result of each video frame.
As described above, a plurality of video frames can be obtained by performing a framing operation for each shot. For example, each shot may be sliced into single video frames using techniques such as Open CV, FFmpeg, and the like. It will be appreciated that, alternatively, other suitable decoders may be utilized for framing operations, and the scope of the claimed subject matter is not limited in this respect.
According to some embodiments of the present disclosure, the subtitle region in each video frame may be detected using a deep learning based character detection model. Examples of character detection models include, but are not limited to, convolutional neural network CNN based character detection models. By using the character detection model based on deep learning, it can be ensured that a more accurate subtitle region recognition result can be obtained even if the video frame has a complex background.
In some examples, video frames meeting the input requirements of the character detection model may be obtained by preprocessing each video frame, and then inputting the preprocessed video frames into the character detection model to obtain the position of the subtitle in the video frames. Examples of preprocessing include, but are not limited to, cropping each video frame to meet the size requirements of the character detection model for the input video frame, improving sharpness, and so forth. Thus, the processing speed of the character detection model can be increased. In the case of clipping the video frame, since the size of the video frame is reduced, the amount of processing data of the character detection model can be reduced, and the processing time of the character detection model can be further shortened.
According to other embodiments of the present disclosure, the caption area in each video frame may also be detected using conventional optical character recognition techniques, and thus the caption area may be more easily recognized.
According to some embodiments of the present disclosure, after a caption area is detected, an image in the caption area may be feature-extracted, for example, using a network structure in a convolutional neural network, and the extracted image features may be decoded using a decoder similar to that described above to obtain a text recognition result. Alternatively or additionally, the preprocessed video frames may be further cropped before feature extraction is performed such that the cropped image only includes the detected subtitle region. Thus, the feature extraction and the data processing amount of decoding the extracted features can be remarkably reduced, so that the character recognition result in the subtitle region can be rapidly obtained.
According to other embodiments of the present disclosure, step S562, acquiring all caption information in shots included in the scene to form a caption information set may include: extracting an audio part in the video to be processed; and performing voice recognition on the audio sub-part associated with the scene in the audio part to obtain all subtitle information in the shots included in the scene.
In some scenarios, there may be situations where the background in the video frame is too complex, or the subtitle font effect is rich. In this case, by extracting an audio portion associated with a scene from a video and applying a voice recognition technique, the speed of subtitle information recognition can be increased, and problems of low recognition efficiency and low accuracy, which may be caused by text recognition, can be avoided.
Fig. 7 illustrates a flow chart of a video processing method according to some embodiments of the present disclosure. As shown in fig. 7, after the video to be processed is acquired, a mirror splitting operation 710 is performed on the video to be processed to identify a plurality of shots included in the video to be processed. Scene boundary detection 720 (e.g., extracting image features for each shot and determining similarity between each shot as described above) may then be performed on these shots to determine which shots therein are scene boundary shots, and scene classification 725 may be performed on each shot based on the scene boundary shots to obtain scene information. In parallel, the scene classification 730, the mirror classification 740, the focus classification 750, and the shooting angle classification 760 may be performed on these lenses based on the scene classification model, the mirror classification model, the focus classification model, and the shooting angle classification model to obtain scene information, mirror information, focus information, and shooting angle information of each lens, respectively. In parallel, a corresponding literal description 770 may be generated for each of the shots to obtain literal description information. In parallel, text recognition 780 may also be performed for each of these shots, and all the subtitle information in the recognized shots is input as a set of subtitle information to the digest generation model to perform digest generation 785, eventually obtaining digest information. Finally, based on scene information, mirror information, focal length information, shooting angle information, text description information and abstract information, lens level analysis of the video to be processed is realized, so that a video shot script meeting the industry standard can be generated.
It should be appreciated that while operations 710-780 are shown in fig. 7 as being performed on video to be processed, this is merely a preferred embodiment of the present disclosure, and in actual practice only one or some of operations 710-780 may be performed. In addition, for purposes of illustration, each of operations 720-780 are shown as parallel operations, but this should not be understood to require that these operations must be performed in a parallel order, which may be performed in other orders as well. For example, these operations may be performed in the order of scene boundary detection 720, scene classification 730, mirror classification 740, focus classification 750, shooting angle classification 760, generation of text description 770, and text recognition 780.
It should also be appreciated that operations 710-780 shown in FIG. 7 may be similar to steps S110-S140, S210-S250, S410-S450, and S510-S560 described with reference to FIGS. 1, 2, 4, and 5. Thus, certain operations, features, and advantages are not described in detail herein for the sake of brevity.
Fig. 8 shows a block diagram of a video processing apparatus 800 according to an embodiment of the present disclosure. As shown in fig. 8, an apparatus 800 may include: an acquisition module 810 configured to acquire a video to be processed; a split mirror operation module 820 configured to perform a split mirror operation on a video to be processed to identify a plurality of shots included in the video to be processed; a determine-split-mirror information module 830 configured to determine, for each of the plurality of shots, one or more split-mirrors information corresponding to the shot based on one or more classification models, wherein the one or more classification models are pre-trained; and a generate shot script module 840 configured to generate a shot script corresponding to the video to be processed based on the one or more pieces of shot information.
According to some embodiments of the present disclosure, wherein the split mirror operation module 820 may include: a module configured to frame a video to be processed to obtain a plurality of video frames; a module configured to determine, for each video frame of the plurality of video frames, a first similarity between the video frame and an adjacent video frame; the apparatus includes means for determining, based on the first similarity, whether the video frame is a boundary video frame of the video to be processed, the boundary video frame being for distinguishing between adjacent shots of the plurality of shots.
According to some embodiments of the present disclosure, the video processing apparatus 800 may further include: and a scene recognition module configured to recognize a scene to which each of the plurality of shots belongs.
According to some embodiments of the present disclosure, wherein identifying the scene module may include: a module configured to extract, for each lens, an image feature of the lens; a module configured to determine a plurality of second similarities between the image features of the lens and the image features of each lens; a module configured to determine, based on the plurality of second similarities, whether the shot is a scene boundary shot of the video to be processed, the scene boundary shot to distinguish adjacent scenes in the video to be processed; and a module configured to determine a plurality of scenes included in the video to be processed from the scene boundary shots and categorize each shot.
According to some embodiments of the present disclosure, wherein the split mirror operation module 820 may include: a module configured to obtain a first plurality of lens samples and a predetermined plurality of lens division information thereof; a module configured to establish one or more classification models with the first plurality of shot samples as training inputs and with the predetermined plurality of partial mirror information as training outputs; and a module configured to categorize the shot based on the one or more categorization models, the categorization of the shot indicating one or more pieces of lens information corresponding to the shot.
According to some embodiments of the disclosure, wherein the one or more classification models include one or more of: jing Bie classification model, fortune mirror classification model, focus classification model, shooting angle classification model, and scene classification model.
According to some embodiments of the present disclosure, the video processing apparatus 800 may further include: and the text description generating module is configured to generate text description corresponding to each lens in the plurality of lenses.
According to some embodiments of the present disclosure, the generating a textual description module may include: a module configured to obtain a second plurality of lens samples, each lens sample of the second plurality of lens samples having a corresponding lens sample literal description; a module configured to build a shot description generation model with the second plurality of shot samples as training inputs and the corresponding shot sample textual descriptions as training outputs; and a module configured to process, for each of the second plurality of shots, the shot as input to a shot description generation model to generate a textual description corresponding to the shot, wherein the shot description generation model includes an encoder that receives each of the shots as input and a decoder that generates the textual description corresponding to each of the shots as output.
According to some embodiments of the present disclosure, the video processing apparatus 800 may further include: and the generating summary module is configured to generate summary information corresponding to each scene in a plurality of scenes included in the video to be processed.
According to some embodiments of the present disclosure, the generating the summary module may include: the building module is configured to build a summary generation model; the subtitle information acquisition module is configured to acquire all subtitle information in a lens included in the scene to form a subtitle information set; and a summary generation sub-module configured to process the set of caption information as input to a summary generation model to generate summary information corresponding to the scene.
According to some embodiments of the disclosure, wherein the establishing module may comprise: a module configured to obtain a third plurality of shot samples in the same scene, each of the third plurality of shot samples having corresponding subtitle information; a module configured to obtain scene sample summary information corresponding to a same scene to which the third plurality of shot samples belong; a module configured to build a pre-trained summary generation model with corresponding subtitle information as a training input and scene sample summary information as a training output, wherein the summary generation model includes an encoder that receives a set of subtitle information as an input and a decoder that generates summary information as an output.
According to some embodiments of the present disclosure, the acquiring subtitle information module may include: a module configured to obtain a plurality of video frames in a shot included in the scene; a module configured to determine, for each video frame of the plurality of video frames, a caption area in the video frame, extract image features associated with the caption area, and decode the image features to obtain a text recognition result in the caption area; and a module configured to form a set of caption information based on the text recognition result of each video frame.
According to some embodiments of the present disclosure, the acquiring subtitle information module may include: a module configured to extract an audio portion in the video to be processed; and a module configured to speech-identify audio subsections of the audio section that are associated with the scene to obtain all subtitle information in a shot included in the scene.
It should be appreciated that the various modules 810-840 of the apparatus 800 shown in fig. 8 may correspond to the various steps S110-S140 in the method 100 described with reference to fig. 1. Thus, the operations, features, and advantages described above with respect to method 100 apply equally to apparatus 800 and the modules comprised thereby. For brevity, certain operations, features and advantages are not described in detail herein.
It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 8 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the acquisition module 810, the mirror operation module 820, the determine mirror information module 830, and the generate mirror script module 840 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.
According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and at least one memory communicatively coupled to the at least one processor; wherein the at least one memory stores a computer program which, when executed by the at least one processor, implements the video processing method described above.
According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the above-described video processing method.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the video processing method described above.
Referring to fig. 9, a block diagram of an electronic device 900 that may be a server of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. The electronic devices may be different types of computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the electronic device 900 may include at least one processor 910, a working memory 920, an input unit 940, a display unit 950, a speaker 960, a storage unit 970, a communication unit 980, and other output units 990 that can communicate with each other through a system bus 930.
Processor 910 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 910 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 910 may be configured to obtain and execute computer readable instructions stored in the working memory 920, the storage unit 970, or other computer readable media, such as program code of the operating system 920a, program code of the application program 920b, and the like.
Working memory 920 and storage unit 970 are examples of computer-readable storage media for storing instructions that are executed by processor 910 to implement the various functions described previously. Working memory 920 may include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, storage unit 970 may include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. The working memory 920 and the storage unit 970 may both be referred to herein collectively as memory or computer-readable storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by the processor 910 as a particular machine configured to implement the operations and functions described in the examples herein.
The input unit 960 may be any type of device capable of inputting information to the electronic device 900, the input unit 960 may receive input digital or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit may be any type of device capable of presenting information and may include, but is not limited to, a displayThe display unit 950, speakers 960, and other output units 990, which other output units 990 may include, but are not limited to, video/audio output terminals, vibrators, and/or printers. The communication unit 980 allows the electronic device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth TM Devices, 802.11 devices, wi-Fi devices, wiMAX devices, cellular communication devices, and/or the like.
The application 920b in the working register 920 may be loaded to perform the various methods and processes described above, e.g., steps S110-S140 in fig. 1, steps S210-S250 in fig. 2, steps S410-S450 in fig. 4, and steps S510-S560 in fig. 5. For example, in some embodiments, the various methods described above may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 970. In some embodiments, some or all of the computer programs may be loaded and/or installed onto electronic device 900 via storage unit 970 and/or communication unit 980. One or more of the steps of the methods 100, 200, 400, 500 described above may be performed when the computer program is loaded and executed by the processor 910. Alternatively, in other embodiments, the processor 910 may be configured to perform the methods 100, 200, 400, 500 in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (17)

1. A video processing method, comprising:
acquiring a video to be processed;
performing a mirror splitting operation on the video to be processed to identify a plurality of lenses included in the video to be processed;
for each shot of the plurality of shots, determining one or more pieces of lens information corresponding to the shot based on one or more classification models, the one or more classification models being pre-trained; and
and generating a shot script corresponding to the video to be processed based on the one or more pieces of shot information.
2. The method of claim 1, wherein the splitting the video to be processed to identify a plurality of shots included in the video to be processed comprises:
framing the video to be processed to obtain a plurality of video frames;
for each video frame of the plurality of video frames, determining a first similarity between the video frame and an adjacent video frame; and
determining whether the video frame is a boundary video frame of the video to be processed based on the first similarity, wherein the boundary video frame is used for distinguishing adjacent shots in the plurality of shots.
3. The method of claim 1, further comprising: a scene to which each of the plurality of shots belongs is identified.
4. The method of claim 3, wherein identifying a scene to which each of the plurality of shots belongs comprises:
extracting image characteristics of each lens aiming at each lens;
determining a plurality of second similarities between the image features of the lens and the image features of each lens;
determining whether the shot is a scene boundary shot of the video to be processed based on the plurality of second similarities, wherein the scene boundary shot is used for distinguishing adjacent scenes in the video to be processed; and
And determining a plurality of scenes included in the video to be processed according to the scene boundary shots, and classifying each shot.
5. The method of any of claims 1-4, wherein, for each of the plurality of shots, determining one or more pieces of mirror information corresponding to the shot based on one or more classification models comprises:
acquiring a first plurality of lens samples and a plurality of pieces of predetermined lens dividing information;
establishing the one or more classification models with the first plurality of shot samples as training inputs and the predetermined plurality of sub-mirror information as training outputs; and
and classifying the lens based on the one or more classification models, wherein the classification of the label indicates one or more pieces of lens information corresponding to the lens.
6. The method of claim 5, wherein the one or more classification models comprise one or more of:
jing Bie classification model, fortune mirror classification model, focus classification model, shooting angle classification model, and scene classification model.
7. The method of any of claims 1-4, further comprising:
For each of the plurality of shots, a textual description corresponding to the shot is generated.
8. The method of claim 7, wherein for each of the plurality of shots, generating a textual description corresponding to the shot comprises:
acquiring a second plurality of lens samples, each lens sample in the second plurality of lens samples having a corresponding lens sample text description;
establishing a shot description generation model by taking the second plurality of shot samples as training inputs and the corresponding shot sample word descriptions as training outputs; and
for each of the second plurality of shots, processing the shot as input to the shot description generation model to generate a textual description corresponding to the shot,
the lens description generation model comprises an encoder for receiving each lens as input and a decoder for generating a text description corresponding to each lens as output.
9. The method of claim 3 or 4, further comprising:
and generating summary information corresponding to each scene in a plurality of scenes included in the video to be processed.
10. The method of claim 9, wherein, for each of a plurality of scenes included in the video to be processed, generating summary information corresponding to the scene comprises:
building a summary generation model;
acquiring all caption information in a lens included in the scene to form a caption information set; and
and processing the caption information set as the input of the abstract generation model to generate abstract information corresponding to the scene.
11. The method of claim 10, wherein building a summary generation model comprises:
acquiring a third plurality of shot samples in the same scene, wherein each shot sample in the third plurality of shot samples has corresponding subtitle information;
acquiring scene sample abstract information corresponding to the same scene to which the third plurality of shot samples belong;
the pre-trained summary generation model is built with the corresponding subtitle information as training input and the scene sample summary information as training output,
wherein the summary generation model comprises an encoder receiving the set of caption information as input and a decoder generating the summary information as output.
12. The method of claim 10 or 11, wherein acquiring all subtitle information in a shot included in the scene to form a set of subtitle information comprises:
obtaining a plurality of video frames in a shot included in the scene;
for each video frame of the plurality of video frames:
determining a caption area in the video frame;
extracting image features associated with the caption area; and
decoding the image features to obtain a character recognition result in the subtitle region; and forming the caption information set based on the text recognition result of each video frame.
13. The method of claim 10 or 11, wherein acquiring all subtitle information in a shot included in the scene to form a set of subtitle information comprises:
extracting an audio part in the video to be processed; and
and performing voice recognition on the audio sub-part associated with the scene in the audio part to obtain all subtitle information in the shots included in the scene.
14. A video processing apparatus comprising:
the acquisition module is configured to acquire a video to be processed;
the mirror splitting operation module is configured to perform mirror splitting operation on the video to be processed so as to identify a plurality of lenses included in the video to be processed;
A determine-split-mirror information module configured to determine, for each of the plurality of shots, one or more split-mirror information corresponding to the shot based on one or more classification models, the one or more classification models being pre-trained; and
and the shot script generation module is configured to generate a shot script corresponding to the video to be processed based on the one or more pieces of shot information.
15. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the at least one processor,
wherein the at least one memory stores a computer program that, when executed by the at least one processor, implements the method of any of claims 1-13.
16. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of any of claims 1-13.
17. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-13.
CN202211666037.9A 2022-12-23 2022-12-23 Video processing method and device, electronic equipment and storage medium Pending CN116233534A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211666037.9A CN116233534A (en) 2022-12-23 2022-12-23 Video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211666037.9A CN116233534A (en) 2022-12-23 2022-12-23 Video processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116233534A true CN116233534A (en) 2023-06-06

Family

ID=86579493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211666037.9A Pending CN116233534A (en) 2022-12-23 2022-12-23 Video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116233534A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117156078A (en) * 2023-11-01 2023-12-01 腾讯科技(深圳)有限公司 Video data processing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117156078A (en) * 2023-11-01 2023-12-01 腾讯科技(深圳)有限公司 Video data processing method and device, electronic equipment and storage medium
CN117156078B (en) * 2023-11-01 2024-02-02 腾讯科技(深圳)有限公司 Video data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112929744B (en) Method, apparatus, device, medium and program product for segmenting video clips
US9646227B2 (en) Computerized machine learning of interesting video sections
KR20210134528A (en) Video processing method, apparatus, electronic device and storage medium and computer program
CN109308490B (en) Method and apparatus for generating information
CN112954450B (en) Video processing method and device, electronic equipment and storage medium
CN110717470B (en) Scene recognition method and device, computer equipment and storage medium
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
US8965067B2 (en) Face data acquirer, end user video conference device, server, method, computer program and computer program product for extracting face data
CN110610510A (en) Target tracking method and device, electronic equipment and storage medium
US10319095B2 (en) Method, an apparatus and a computer program product for video object segmentation
CN108229481B (en) Screen content analysis method and device, computing equipment and storage medium
CN112925905B (en) Method, device, electronic equipment and storage medium for extracting video subtitles
CN112150457A (en) Video detection method, device and computer readable storage medium
CN113205047A (en) Drug name identification method and device, computer equipment and storage medium
CN112115950B (en) Wine mark identification method, wine information management method, device, equipment and storage medium
CN112161984B (en) Wine positioning method, wine information management method, device, equipment and storage medium
CN114022668A (en) Method, device, equipment and medium for aligning text with voice
CN116233534A (en) Video processing method and device, electronic equipment and storage medium
US11348254B2 (en) Visual search method, computer device, and storage medium
CN111815748B (en) Animation processing method and device, storage medium and electronic equipment
CN111914850B (en) Picture feature extraction method, device, server and medium
CN115937742B (en) Video scene segmentation and visual task processing methods, devices, equipment and media
US20230066331A1 (en) Method and system for automatically capturing and processing an image of a user
CN117351555A (en) Lip language identification method and device and model training method and device
CN114510585B (en) Information characterization model construction method and information characterization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination