CN117579859A - Video processing method, device, equipment and readable storage medium - Google Patents
Video processing method, device, equipment and readable storage medium Download PDFInfo
- Publication number
- CN117579859A CN117579859A CN202311523011.3A CN202311523011A CN117579859A CN 117579859 A CN117579859 A CN 117579859A CN 202311523011 A CN202311523011 A CN 202311523011A CN 117579859 A CN117579859 A CN 117579859A
- Authority
- CN
- China
- Prior art keywords
- moving object
- boundary
- key frame
- frame
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 6
- 238000000034 method Methods 0.000 claims abstract description 77
- 238000012545 processing Methods 0.000 claims abstract description 62
- 238000001514 detection method Methods 0.000 claims abstract description 58
- 230000003449 preventive effect Effects 0.000 claims abstract description 14
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 7
- 230000033001 locomotion Effects 0.000 claims description 40
- 230000015654 memory Effects 0.000 claims description 17
- 238000001228 spectrum Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 238000003786 synthesis reaction Methods 0.000 claims description 2
- 239000002131 composite material Substances 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 23
- 238000005516 engineering process Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 14
- 238000013473 artificial intelligence Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000002093 peripheral effect Effects 0.000 description 10
- 230000001133 acceleration Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 238000007667 floating Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000010409 thin film Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23424—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234345—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/431—Generation of visual interfaces for content selection or interaction; Content or additional data rendering
- H04N21/4312—Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440245—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Marketing (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a video processing method, a device, equipment and a readable storage medium, and belongs to the technical field of multimedia. The method comprises the following steps: acquiring an initial video, and processing the initial video to obtain a first key frame and a second key frame, wherein the first key frame and the second key frame contain moving objects; detecting the moving object in the first key frame and the second key frame by utilizing the target detection model to obtain a target moving object and a first boundary frame and a second boundary frame of the target moving object; acquiring bullet screen information, and determining an adjustment area based on the first boundary frame, the second boundary frame and the bullet screen information; performing preventive shielding adjustment on bullet screen information in the adjustment area to obtain updated bullet screen information; and synthesizing the updated barrage information and the initial video to obtain a synthesized video. According to the method and the device, a more accurate adjustment area can be obtained, shielding of the barrage in the composite video to the target moving object is effectively reduced, and viewing experience of a user is improved.
Description
Technical Field
The embodiment of the application relates to the technical field of multimedia, in particular to a method, a device, equipment and a readable storage medium for video processing.
Background
With the development of multimedia technology and the popularization of network video platforms, the barrage is increasingly popular with users as a novel video interaction mode. In the process of watching the video, the user can interact in real time by sending the barrage, so that the interestingness and interactivity of the video are improved.
Disclosure of Invention
Embodiments of the present application provide a method, apparatus, device, and readable storage medium for video processing, which can be used to solve the problems in the related art. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for video processing, where the method includes: acquiring an initial video, and processing the initial video to obtain a first key frame and a second key frame, wherein the first key frame and the second key frame contain moving objects; detecting the moving object in the first key frame and the second key frame by using a target detection model to obtain a target moving object and a first boundary frame and a second boundary frame of the target moving object, wherein the first boundary frame is a boundary frame of the target moving object in the first key frame, and the second boundary frame is a boundary frame of the target moving object in the second key frame; acquiring bullet screen information, and determining an adjustment area based on the first boundary box, the second boundary box and the bullet screen information; performing preventive shielding adjustment on the barrage information in the adjustment area to obtain updated barrage information; and synthesizing the updated barrage information and the initial video to obtain a synthesized video.
In another aspect, an embodiment of the present application provides an apparatus for video processing, including: the decoding module is used for acquiring an initial video, processing the initial video to obtain a first key frame and a second key frame, wherein the first key frame and the second key frame contain moving objects; the detection module is used for detecting the moving object in the first key frame and the second key frame by utilizing a target detection model to obtain a target moving object and a first boundary frame and a second boundary frame of the target moving object, wherein the first boundary frame is the boundary frame of the target moving object in the first key frame, and the second boundary frame is the boundary frame of the target moving object in the second key frame; the determining module is used for acquiring bullet screen information and determining an adjustment area based on the first boundary frame, the second boundary frame and the bullet screen information; the updating module is used for carrying out preventive shielding adjustment on the barrage information in the adjustment area to obtain updated barrage information; and the synthesis module is used for synthesizing the updated barrage information and the initial video to obtain a synthesized video.
In a possible implementation manner, the decoding module is configured to sample the initial video based on sampling parameters of a decoder to obtain a preprocessed video; acquiring motion vector information of the moving object based on the initial video; based on the motion vector information, decoding the preprocessed video by using a decoder to obtain a plurality of key frames and time information corresponding to the key frames; and detecting the key frames based on the time information to obtain the first key frames and the second key frames, wherein the first key frames are key frames which are detected to contain the mobile object for the first time, and the second key frames are key frames which contain the mobile object except the first key frames.
In a possible implementation manner, the detection module is configured to perform feature extraction on the first key frame and the second key frame based on the target detection model, so as to obtain a first feature map and a second feature map; identifying the moving object in the first characteristic spectrum and the second characteristic spectrum to obtain the target moving object; determining the first boundary box of the target mobile object in the first characteristic map and the second boundary box of the target mobile object in the second characteristic map.
In a possible implementation manner, the detection module is configured to perform a first convolution process on the first key frame and the second key frame based on a first convolution kernel of the target detection model, so as to obtain a first sub-feature map and a second sub-feature map; and carrying out second convolution processing on the first sub-feature map and the second sub-feature map based on a second convolution kernel of the target detection model to obtain the first feature map and the second feature map.
In a possible implementation manner, the detection module is configured to identify the moving object in the first feature map and the second feature map, and obtain a first confidence coefficient of the moving object, where the first confidence coefficient is used to represent a probability that the moving object is the target moving object; and comparing the first confidence coefficient with a reference confidence coefficient, and determining the moving object as the target moving object if the first confidence coefficient is larger than or equal to the reference confidence coefficient.
In a possible implementation manner, the detection module is configured to obtain coordinates of the target moving object, generate a boundary box prediction result of the target moving object based on the coordinates, where the boundary box prediction result includes a plurality of prediction boundary boxes of the target moving object and second confidence degrees corresponding to the respective prediction boundary boxes, and the second confidence degrees are used to represent probabilities that the target moving object is included in the prediction boundary boxes; determining a reference boundary box based on the second confidence degrees, wherein the reference boundary box is a prediction boundary box corresponding to the second confidence degree with the largest second confidence degree among the second confidence degrees corresponding to the plurality of prediction boundary boxes; calculating an intersection ratio of the respective prediction boundary frames and the reference boundary frame, and determining the first boundary frame and the second boundary frame of the target moving object from among the prediction boundary frames of the first feature map and the second feature map using the intersection ratio.
In a possible implementation manner, the determining module is configured to calculate a first overlapping area of the bullet screen information and the first bounding box, and a second overlapping area of the bullet screen information and the second bounding box; the adjustment region is determined based on the first and second overlap regions, the adjustment region including the first and second overlap regions.
In one possible implementation manner, the updating module is configured to generate barrage adjustment information based on the adjustment area and the barrage information, where the barrage adjustment information includes: at least one of initial coordinates of the barrage, barrage moving speed, barrage moving direction, barrage transparency and barrage character size; and utilizing the bullet screen adjusting information to carry out preventive shielding adjustment on the bullet screen in the adjusting area, and obtaining updated bullet screen information.
In another aspect, embodiments of the present application provide a computer device, where the computer device includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so that the computer device implements a method of video processing as described in any one of the above.
In another aspect, there is provided a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to cause a computer to implement a method of video processing as described in any of the above.
In another aspect, a computer program or computer program product is provided, in which at least one computer instruction is stored, the at least one computer instruction being loaded and executed by a processor, to cause the computer to implement a method of any of the above.
The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:
according to the method and the device, the first boundary frame and the second boundary frame of the target mobile object are determined in the first key frame and the second key frame, the adjustment area of the barrage information is determined based on the first boundary frame and the second boundary frame, accuracy of the barrage information adjustment area in the moving process of the target mobile object is improved to a certain extent, shielding of the barrage in the composite video to the target mobile object is effectively reduced, and watching experience of a user is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an implementation environment of a video processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of video processing provided by an embodiment of the present application;
FIG. 3 is a flow chart of obtaining a first key frame and a second key frame provided in an embodiment of the present application;
FIG. 4 is a flow chart of obtaining a first bounding box and a second bounding box based on an object detection model provided by an embodiment of the present application;
FIG. 5 is a flowchart of identifying a target mobile object according to an embodiment of the present application;
fig. 6 is a schematic diagram of an object recognition result based on a feature map according to an embodiment of the present application;
FIG. 7 is a flowchart of a bounding box for obtaining a target mobile object provided by an embodiment of the present application;
FIG. 8 is a flow chart for determining an adjustment region provided by an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;
fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
Fig. 1 is a schematic diagram of an implementation environment of a video processing method according to an embodiment of the present application, as shown in fig. 1, where the implementation environment includes: computer device 101, computer device 101 is used for carrying out the video processing method that the embodiment of the application provided.
Alternatively, the computer device 101 may be a terminal device, and the terminal device may be any electronic device product that can perform man-machine interaction with a user through one or more manners of a keyboard, a touchpad, a remote controller, a voice interaction or a handwriting device. For example, a PC (Personal Computer ), a mobile phone, a smart phone, a PDA (Personal Digital Assistant ), a wearable device, a PPC (Pocket PC), a tablet computer, or the like.
The terminal device may refer broadly to one of a plurality of terminal devices, and the present embodiment is illustrated by way of example only. Those skilled in the art will appreciate that the number of terminal devices described above may be greater or lesser. For example, the number of the terminal devices may be only one, or the number of the terminal devices may be tens or hundreds, or more, and the number and the device types of the terminal devices are not limited in the embodiment of the present application.
Alternatively, the computer device 101 may be a server, where the server is a server, or a server cluster formed by multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server has a data receiving function, a data processing function, and a data transmitting function. Of course, the server may also have other functions, which are not limited in this embodiment of the present application.
It will be appreciated by those skilled in the art that the above described terminal devices and servers are merely illustrative, and that other terminal devices or servers, now existing or hereafter may be present, are intended to be within the scope of the present application, and are incorporated herein by reference.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important reform for the development of computer Vision technology, and a swin-transform (rotary context network), viT (Vision Transformer, vision converter), V-MOE (Vision MoE, vision mixing network), MAE (masked autoencoders, masking self-encoder) and other Vision fields of pre-training models can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (Three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, and the like, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like.
The embodiment of the present application provides a method for video processing, which may be applied to the implementation environment shown in fig. 1, taking the flowchart of the method for video processing provided in the embodiment of the present application shown in fig. 2 as an example, where the method may be executed by the computer device 101 in fig. 1. As shown in fig. 2, the method includes the following steps 110 to 150.
In step 110, an initial video is acquired, and the initial video is processed to obtain a first key frame and a second key frame, where the first key frame and the second key frame include a moving object.
In the exemplary embodiment of the present application, the initial video may be a video acquired from a video player or video playing software of a browser, or may be a directly shot video. The initial video is a video file after being encoded, and may include a video file to be processed, and may further include information related to the initial video, such as a video theme, a video tag, a video profile, and the like. Illustratively, the initial video may be a sports event video, a live video, a movie theatrical video, a variety video, an advertising video, and the like. It should be noted that the source and content of the initial video are exemplary in this application, and the application is not limited thereto.
After the initial video is acquired, the initial video may be decoded using a video processing tool to obtain key frames, which may include a first key frame and a second key frame. The key Frame is an I Frame (Intra-coded Frame) in the initial video, and may include complete image information, and the key Frame may be decoded without depending on other frames in the decoding process, and may be used as a reference point for fast positioning and playing when playing the video.
Fig. 3 is a flowchart of obtaining a first key frame and a second key frame according to an embodiment of the present application. As shown in fig. 3, obtaining the first key frame and the second key frame may include steps 111 through 114.
In step 111, the initial video is sampled based on the sampling parameters of the decoder, resulting in a preprocessed video.
Illustratively, before decoding the initial video, sampling parameters of the decoder may be set based on the size of the initial video file and the initial video playback time length, and the sampling parameters may include a downsampling ratio. For example, the downsampling ratio of the decoder may be set to 0.5, and the initial video may be processed by the downsampling ratio, so that a preprocessed video may be obtained, wherein the resolution of the preprocessed video is smaller than the resolution of the initial video, and when the downsampling ratio is 0.5, the resolution of the preprocessed video is half the resolution of the initial video.
In step 112, motion vector information of the moving object is acquired based on the initial video.
For example, motion vector information of a moving object of an initial video during encoding may be acquired, wherein the motion vector information is mainly used to predict a position of the moving object in a subsequent key frame. For example, the motion vector information is a displacement between a pixel block in a key frame subsequent to the current key frame and a corresponding pixel block in the current key frame. The motion vector information of the moving object can be obtained by the following method: firstly dividing a key frame into a plurality of pixel blocks, determining the pixel blocks where a moving object is located in the current key frame and the position information of the pixel blocks, matching the pixel blocks based on a matching algorithm in different key frames, determining the position information of the corresponding pixel blocks of the moving object in other key frames and the position information of the pixel blocks, and determining the motion vector information by utilizing the position information of the pixel blocks. The motion vector information may be represented by horizontal component information (dx) and vertical component information (dy) of the movement locus of the moving object, or may be represented by the direction and distance of the movement locus of the moving object.
In step 113, based on the motion vector information, the pre-processed video is decoded by a decoder to obtain a plurality of key frames and time information corresponding to the key frames.
The present application describes motion vector information in which horizontal component information (dx) and vertical component information (dy) represent moving objects, as an example. In the decoding process, the size of the pixel block, that is, the size of the pixel block in the motion vector information, is first determined, and the sizes of the pixel blocks supported by different standards for video encoding and decoding are different, for example, the sizes of the pixel blocks of 4×4, 8×8, 16×16, etc. are supported by taking the highly compressed digital video codec standard h.264 as an example.
After determining the size of the pixel block, the moving object is motion estimated and motion compensated using the horizontal component information (dx) and the vertical component information (dy) to determine the position of the moving object in the decoder. And then decoding the preprocessed video based on the position of the moving object in the decoder to obtain a plurality of key frames and time information corresponding to the key frames. The time information corresponding to the key frames can be obtained based on the video playing sequence.
For example, after the motion vector information of the pixel block is obtained, the pixel block may be further subjected to preliminary screening based on the motion vector information, to remove non-important information contained in the key frame or a pixel block of minute motion that is difficult to be recognized by the naked eye. For example, a filtering threshold of motion vector information may be set, the motion vector information of a block of pixels may be compared with the filtering threshold, and blocks of pixels having motion vector information less than the filtering threshold may be removed. The pixel blocks are subjected to preliminary screening, so that the data quantity processed in the subsequent process can be reduced to a certain extent, and the data processing efficiency is improved.
In step 114, the key frames are detected based on the time information, so as to obtain a first key frame and a second key frame, wherein the first key frame is the key frame which is detected for the first time and contains the moving object, and the second key frame is the key frame which contains the moving object except the first key frame.
For example, after obtaining the key frame and the time information corresponding to the key frame, the key frame may be detected based on the time information, to detect whether the key frame contains the moving object. If the first detection of the key frame contains the moving object, the key frame can be used as a first key frame; the remaining key frames are detected based on the motion vector information, and the key frames detected to contain the moving object other than the first key frame are taken as second key frames.
According to the method and the device for detecting the moving object, the key frames are obtained through decoding processing of the initial video, and in the decoding process, the position of the moving object in the second key frames is predicted by using the first key frames and the motion vector information, so that the calculated amount of target detection can be reduced to a certain extent.
In step 120, the moving object is detected in the first key frame and the second key frame by using the target detection model, so as to obtain a first bounding box and a second bounding box of the target moving object and the target moving object, where the first bounding box is a bounding box of the target moving object in the first key frame, and the second bounding box is a bounding box of the target moving object in the second key frame.
Fig. 4 is a flowchart of obtaining a first bounding box and a second bounding box based on an object detection model according to an embodiment of the present application. As shown in fig. 4, obtaining the first bounding box and the second bounding box based on the object detection model may include steps 121 through 123.
In step 121, feature extraction is performed on the first key frame and the second key frame based on the target detection model, so as to obtain a first feature map and a second feature map.
In an exemplary embodiment of the present application, the target detection model may be a lightweight convolutional neural network model that has completed training, such as a MobileNet model. Before the first key frame and the second key frame are processed by using the target detection model, image preprocessing can be performed on the first key frame and the second key frame, and preprocessing feature maps corresponding to the first key frame and the second key frame are obtained. The image preprocessing may include resizing, formatting, normalizing pixel values, and the like, for the first key frame and the second key frame. And inputting the preprocessing feature maps corresponding to the first key frame and the second key frame into the target detection model.
Illustratively, based on a first convolution kernel of the target detection model, a first convolution process is performed on the first key frame and the second key frame to obtain a first sub-feature map and a second sub-feature map. Optionally, the first convolution kernel has a depth of 1. For example, the pretreatment feature pattern sizes of the first and second key frames are h×w×c, where H, W and C represent the height, width, and channel number of the pretreatment feature patterns, respectively, and H, W and C are positive integers. And then, carrying out convolution operation on each channel of the first key frame and the second key frame based on the first convolution check to obtain a first sub-feature map and a second sub-feature map. The first convolution kernel may be k×k×1, i.e., the width and height of the first convolution kernel are k, e.g., the depth is 1,0 < k < H,0 < k < W, and k is an integer. The calculated amount of the first convolution is h×w×c×k×k. By performing the first convolution processing on the pre-processing feature maps of the first key frame and the second key frame, the spatial information of each channel in the pre-processing feature map can be extracted, but the information of different channels can not be mixed.
And then, based on a second convolution kernel of the target detection model, performing second convolution processing on the first sub-feature map and the second sub-feature map to obtain a first feature map and a second feature map. Optionally, the width and height of the second convolution kernel is 1. The first sub-feature map is subjected to second convolution to obtain a first feature map, and the second sub-feature map is subjected to second convolution to obtain a second feature map. For example, a second convolution kernel 1×1×c is used to perform a second convolution operation on the first sub-feature spectrum and the second sub-feature spectrum, where the width and height of the second convolution kernel are 1, the number of input channels of the second convolution is C, the number of output channels is C ', and the calculated amount of the second convolution is h×w×c'. By performing the second convolution processing on the first sub-feature map and the second sub-feature map, information contained in the first sub-feature map and the second sub-feature map in different channels can be mixed, and meanwhile spatial information of the target detection model is reserved.
According to an exemplary embodiment of the present application, a first convolution process and a second convolution process are performed on a first key frame and a second key frame. Optionally, the depth of the first convolution kernel is 1, and the width and the height of the second convolution kernel are 1, so that the calculation amount of convolution calculation can be effectively reduced while the feature extraction of the first key frame and the second key frame can be ensured, and the feature extraction efficiency is improved. And the first key frame and the second key frame are subjected to feature extraction based on the lightweight convolution model, so that the memory requirement in the feature extraction process can be reduced.
In step 122, the moving object in the first feature map and the second feature map is identified, so as to obtain a target moving object.
For example, after the first and second feature maps are obtained, the moving object in the first and second feature maps may be identified. The target moving object is an object which is not blocked by the barrage in the processed video. Taking football matches as an example, the moving objects may include players, football, referees, spectators, photographers, etc., and the user's focus on football matches is mostly focused on players, football and referees when watching video of football matches based on a video player. Therefore, the player, the football, and the referee can be regarded as the target moving object, and when the target moving object is determined, the player, the football, and the referee need to be identified in the first feature map and the second feature map.
Fig. 5 is a flowchart of identifying a target mobile object according to an embodiment of the present application. As shown in fig. 5, identifying the target mobile object may include step 1221 and step 1222.
In step 1221, the moving object in the first feature map and the second feature map is identified, so as to obtain a first confidence coefficient of the moving object, where the first confidence coefficient is used to represent a probability that the moving object is a target moving object.
Taking football match as an example, setting target moving objects as players, referees and football, and identifying all the moving objects in the first characteristic spectrum and the second characteristic spectrum by utilizing the target detection model to obtain an identification result, wherein the identification result can comprise identification information of the moving objects and first confidence corresponding to the identification information.
Fig. 6 is a schematic diagram of an object recognition result based on a feature map according to an embodiment of the present application. As shown in fig. 6, the moving object in the feature map is identified by using the object detection model, so as to obtain an identification result of the moving object. For example, the identification information of the moving object 1 is a player, and the corresponding first confidence is 87%; the identification information of the moving object 2 is a player, and the corresponding first confidence coefficient is 90%; the identification information of the moving object 3 is a player, and the corresponding first confidence coefficient is 95%; the identification information of the moving object 4 is a player, and the corresponding first confidence is 67%; the identification information of the moving object 5 is a player, and the corresponding first confidence coefficient is 97%; the identification information of the moving object 8 is a player, and the corresponding first confidence coefficient is 83%; the identification information of the moving object 6 is football, and the corresponding first confidence coefficient is 99%; the identification information of the mobile object 7 comprises three types, wherein the first type of identification information is a player, the corresponding first confidence coefficient is 35%, the second type of identification information is a referee, the corresponding first confidence coefficient is 40%, the third type of identification information is a spectator, and the corresponding first confidence coefficient is 25%.
In step 1222, the first confidence level is compared to the reference confidence level, and if the first confidence level is greater than or equal to the reference confidence level, the mobile object is determined to be the target mobile object.
For example, the reference confidence coefficient corresponding to the target mobile object may be set in advance in the target detection model, the first confidence coefficient corresponding to all the identification information of each mobile object in the first feature map and the second feature map is respectively compared with the reference confidence coefficient, and if the first confidence coefficient corresponding to the identification information of the mobile object is greater than or equal to the reference confidence coefficient, the mobile object is determined to be the target mobile object. The reference confidence may be set based on experience or implementation scenario, for example, taking fig. 6 as an example, the reference confidence is set to 80%, the first confidence corresponding to the identification information player or football in the moving object 1, the moving object 2, the moving object 3, the moving object 5, the moving object 6 and the moving object 8 is greater than the reference confidence, and the moving object is determined to be the target moving object. If the first confidence of the moving object 4 and the moving object 7 is smaller than the reference confidence, it is determined that the moving object 4 and the moving object 7 are not target moving objects.
In step 123, a first bounding box of the target mobile object in the first feature map and a second bounding box of the target mobile object in the second feature map are determined.
For example, an additional convolution layer may be provided in the target detection model, with which a bounding box is generated for the target moving object. Fig. 7 is a flowchart of obtaining a bounding box of a target moving object according to an embodiment of the present application. As shown in fig. 7, obtaining the bounding box of the target moving object may include steps 1231 to 1233.
In step 1231, coordinates of the target mobile object are obtained, and a boundary box prediction result of the target mobile object is generated based on the coordinates, wherein the boundary box prediction result includes a plurality of prediction boundary boxes of the target mobile object and a second confidence corresponding to the prediction boundary boxes, and the second confidence is used for representing a probability that the target mobile object is included in the prediction boundary boxes.
For example, after determining the target moving object, coordinates of the target moving object may be acquired. The coordinates of the target moving object may be relative coordinates or absolute coordinates. If the coordinates of the target moving object are relative coordinates, the relative coordinates of the target moving object can be converted into absolute coordinates in the subsequent decoding process. The present application describes an example in which the coordinates of the target object are absolute coordinates, and the absolute coordinates of the target moving object may be the coordinates of the target moving object with respect to the decoder.
After determining the coordinates of the target mobile object, a bounding box prediction result may also be generated based on the coordinates of the target mobile object using a target detection algorithm, where the bounding box prediction result includes a plurality of prediction bounding boxes of the target mobile object and a second confidence level corresponding to the prediction bounding boxes, the plurality of prediction bounding boxes may be of different aspect ratios. Illustratively, the target detection algorithm may include one or more of a KCF (Kernelized Correlation Filters, kernel-related filtering) algorithm, MILs (Multiple Instance Learning, multiple example learning) algorithm.
In step 1232, a reference bounding box is determined based on the second confidence levels, the reference bounding box being a prediction bounding box corresponding to a second confidence level that is the largest of the second confidence levels corresponding to the plurality of prediction bounding boxes.
For example, among the second confidence levels corresponding to the plurality of prediction boundary boxes of the target moving object, a prediction boundary box corresponding to the largest second confidence level is selected as the reference boundary box. By selecting the prediction boundary frame corresponding to the second confidence coefficient with the largest value as the reference boundary frame, the matching degree of the reference boundary frame and the rest prediction boundary frames can be ensured to a certain extent, and more accurate detection results can be obtained in the subsequent target detection process.
In step 1233, the intersection ratio of each of the prediction boundary frames and the reference boundary frame is calculated, and the first boundary frame and the second boundary frame of the target moving object are determined from among the prediction boundary frames of the first feature map and the second feature map using the intersection ratio.
Taking the first feature map as an example, calculating the intersection ratio of each prediction boundary frame and the reference boundary frame, namely the ratio of the intersection of each prediction boundary frame and the reference boundary frame to the union of each prediction boundary frame and the reference boundary frame, deleting the prediction boundary frames with the intersection ratio larger than the boundary frame threshold value, and repeating the steps until the final first boundary frame is obtained.
When the second bounding box is obtained in the second feature map, the position of the target moving object in the second feature map may be assisted to be located based on the pixel block where the target moving object is located and the motion vector information corresponding to the pixel block, and the second bounding box of the target moving object may be determined based on the position of the target moving object in the second feature map. The step of determining the second bounding box of the target moving object in the second feature map is similar to the step of acquiring the first bounding box in the first feature map, and will not be described in detail herein. In the process of target detection, the position of the target moving object in each characteristic map is subjected to auxiliary positioning through the motion vector information, so that the efficiency of target detection can be improved.
It should be noted that, the decoding process and the target detection of the initial video may be performed simultaneously. For example, in the video decoding process, a first key frame and a second key frame may be determined from key frames already decoded, and the first key frame and the second key frame are processed by using the object detection model to obtain a first bounding box and a second bounding box. The video decoding step and the target detection step can be performed simultaneously, so that the video processing efficiency is improved to a certain extent, and the video processing time is saved.
In step 130, bullet screen information is acquired and an adjustment area is determined based on the first bounding box, the second bounding box, and the bullet screen information.
In an exemplary embodiment of the present application, after the first bounding box and the second bounding box are obtained, coordinate parameters of the first bounding box and the second bounding box of each target moving object may also be obtained, where the coordinate parameters may include upper left corner coordinates and lower right corner coordinates of the first bounding box and the second bounding box. Meanwhile, bullet screen information is acquired, wherein the bullet screen information can comprise one or more of initial coordinates of a bullet screen, movement speed of the bullet screen, movement direction of the bullet screen, content of the bullet screen and character size of the bullet screen.
Calculating whether the bullet screen intersects with the first boundary frame and/or the second boundary frame of the target moving object in the moving process based on the initial coordinates of the bullet screen information bullet screen, the movement speed of the bullet screen, the movement direction of the bullet screen, the content of the bullet screen, the character size of the bullet screen and the like, and if the bullet screen intersects with the first boundary frame and/or the second boundary frame, determining an adjustment area of the bullet screen and adjusting the bullet screen in the adjustment area is needed.
Fig. 8 is a flowchart for determining an adjustment area according to an embodiment of the present application. As shown in fig. 8, determining the adjustment region may include step 131 and step 132.
In step 131, a first overlapping area of the bullet screen information and the first bounding box and a second overlapping area of the bullet screen information and the second bounding box are calculated.
In an exemplary embodiment of the present application, a moving track of each bullet screen may be drawn based on bullet screen information, for example, the moving track of the bullet screen may be drawn as a moving rectangle. And calculating a first overlapping area of the moving track of the barrage and the first boundary frame and a second overlapping area of the moving track of the barrage and the second boundary frame. In the process of calculating the second overlap area, when the content of the bullet screen is large or the moving speed of the bullet screen is slow, at least one second key frame may be involved in the moving track of the bullet screen, so that the number of the second key frames to be processed needs to be confirmed based on the moving track of the bullet screen, and the second overlap area of the moving track of the bullet screen and the second boundary frame in each second key frame needs to be determined.
In another exemplary embodiment of the present application, the bullet screen setting information of the user may also be acquired, and the bullet screen information is determined using the bullet screen setting information of the user, where the bullet screen information includes a bullet screen display area. Illustratively, the user may set the proportion of the barrage to the whole screen in the video playing software, and determine the barrage display area. The first overlap region may be an intersection of the first bounding box and the barrage display region, and the second overlap region may be an intersection of the second bounding box and the barrage display region.
In step 132, a target adjustment region is determined based on the first and second overlap regions, the target adjustment region including the first and second overlap regions.
Illustratively, after the first overlapping region and the second overlapping region are obtained, a union region of the first overlapping region and the second overlapping region is taken as a target adjustment region, where the target adjustment region may be located in at least one key frame.
In step 140, the bullet screen information in the adjustment area is subjected to preventive shielding adjustment to obtain updated bullet screen information.
In the exemplary embodiment of the present application, in the case of a large number of bullet screens, related information of the initial video may also be obtained, where the related information includes a video theme, a video tag, a video profile, and the like. And determining the relevance and the heat of each barrage by utilizing the relevant information and barrage information of the initial video, and screening the barrages based on the relevance and/or the heat of the barrages to obtain a target barrage, wherein the target barrage is a barrage with the relevance being greater than or equal to a relevance threshold value and/or the heat being greater than a heat threshold value. The target barrage may be preferentially displayed in the event that the number of barrages is excessive. The relevance threshold and the heat threshold may be flexibly set based on experience or application scenario, which is not specifically limited in this application.
Taking football matches as an example, the profile information of the initial video may include: performing correlation detection on the content of the barrage and the related information to obtain the correlation of each barrage, wherein the football game and the team's name, the player's name, score and the like; the heat of each bullet screen can be determined by the number of praise per bullet screen, and the target bullet screen can be determined by the correlation and/or heat.
In an exemplary embodiment of the present application, the bullet screen adjustment information may be generated based on the target adjustment area and the bullet screen information, where the bullet screen adjustment information may include: at least one of initial coordinates of the bullet screen, movement speed of the bullet screen, movement direction of the bullet screen, transparency of the bullet screen, and character size of the bullet screen. Illustratively, the bullet screen adjustment information may be implemented by a bullet screen adjustment function, and the bullet screen of the target adjustment area is adjusted by using the bullet screen adjustment function to obtain updated bullet screen information. For example, the preventive shielding adjustment of the barrage can be realized by creating a barrage floating layer or a mask based on the adjustment area. The setting parameters of the barrage floating layer or the mask can be determined through barrage adjustment information, the setting parameters can comprise the size and the position of the barrage floating layer or the mask, and the corresponding barrage floating layer or mask is generated based on the setting parameters. The present application describes, as an example, a bullet screen adjustment information and a bullet screen preventing and shielding adjustment method, and the present application does not limit the bullet screen adjustment method and the bullet screen preventing and shielding adjustment method in the bullet screen information.
In step 150, the updated bullet screen information and the initial video are synthesized to obtain a synthesized video.
In an exemplary embodiment of the present application, after the updated barrage information is obtained, the updated barrage information may be synthesized with the initial video to obtain a synthesized video. After the target composite video is obtained, the target composite video can be sent to a corresponding player for playing, wherein the player can be a browser-based player or other special player software.
According to the method provided by the embodiment of the application, the initial video is processed to obtain the first key frame and the second key frame, the first boundary frame and the second boundary frame of the target moving object in the first key frame and the second key frame are determined by utilizing the target detection model, the adjustment area is determined based on the first boundary frame and the second boundary frame, the bullet screen is subjected to preventive shielding adjustment based on the adjustment area, and the target synthetic video is further obtained. According to the method and the device, the adjustment area of the bullet screen information is determined through the first key frame and the second key frame, accuracy of the bullet screen information adjustment area in the moving process of the target moving object is improved to a certain extent, shielding of the bullet screen in the synthesized video to the target moving object is effectively reduced, and watching experience of a user is improved.
Fig. 9 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. As shown in fig. 9, the apparatus includes:
the decoding module 210 is configured to obtain an initial video, and process the initial video to obtain a first key frame and a second key frame, where the first key frame and the second key frame include moving objects.
The detection module 220 is configured to detect the moving object in a first key frame and a second key frame by using the target detection model, so as to obtain a first bounding box and a second bounding box of the target moving object and the target moving object, where the first bounding box is a bounding box of the target moving object in the first key frame, and the second bounding box is a bounding box of the target moving object in the second key frame.
The determining module 230 is configured to obtain bullet screen information, and determine an adjustment area based on the first bounding box, the second bounding box, and the bullet screen information.
And the updating module 240 is configured to perform preventive shielding adjustment on the barrage information in the adjustment area, so as to obtain updated barrage information.
And the synthesizing module 250 is used for synthesizing the updated barrage information and the initial video to obtain a synthesized video.
In an exemplary embodiment, the decoding module 210 is configured to sample the initial video based on the sampling parameters of the decoder to obtain a preprocessed video; acquiring motion vector information of a moving object based on an initial video; based on the motion vector information, decoding the preprocessed video by using a decoder to obtain a plurality of key frames and time information corresponding to the key frames; and detecting the key frames based on the time information to obtain a first key frame and a second key frame, wherein the first key frame is the key frame which contains the mobile object and is detected for the first time, and the second key frame is the key frame which contains the mobile object except the first key frame.
In an exemplary embodiment, the detection module 220 is configured to perform feature extraction on the first key frame and the second key frame based on the target detection model, so as to obtain a first feature map and a second feature map; identifying the moving object in the first characteristic spectrum and the second characteristic spectrum to obtain a target moving object; and determining a first boundary box of the target moving object in the first characteristic map and a second boundary box of the target moving object in the second characteristic map.
In an exemplary embodiment, the detection module 220 is configured to perform a first convolution process on the first key frame and the second key frame based on a first convolution kernel of the target detection model, to obtain a first sub-feature map and a second sub-feature map; and carrying out second convolution processing on the first sub-feature map and the second sub-feature map based on a second convolution kernel of the target detection model to obtain a first feature map and a second feature map.
In an exemplary embodiment, the detection module 220 is configured to identify a moving object in the first feature map and the second feature map, to obtain a first confidence coefficient of the moving object, where the first confidence coefficient is used to represent a probability that the moving object is a target moving object; and comparing the first confidence coefficient with the reference confidence coefficient, and determining the moving object as a target moving object if the first confidence coefficient is greater than or equal to the reference confidence coefficient.
In an exemplary embodiment, the detection module 220 is configured to obtain coordinates of the target moving object, generate a bounding box prediction result of the target moving object based on the coordinates, where the bounding box prediction result includes a plurality of prediction bounding boxes of the target moving object and second confidence degrees corresponding to the respective prediction bounding boxes, and the second confidence degrees are used to represent probabilities that the target moving object is included in the prediction bounding boxes; determining a reference boundary frame based on the second confidence degrees, wherein the reference boundary frame is a prediction boundary frame corresponding to the second confidence degree with the largest second confidence degree among the second confidence degrees corresponding to the plurality of prediction boundary frames; and calculating the intersection ratio of each prediction boundary frame and the reference boundary frame, and determining a first boundary frame and a second boundary frame of the target moving object in the prediction boundary frames of the first characteristic map and the second characteristic map by using the intersection ratio.
In an exemplary embodiment, the determining module 230 is configured to calculate a first overlapping region of the bullet screen information and the first bounding box, and a second overlapping region of the bullet screen information and the second bounding box; an adjustment region is determined based on the first overlap region and the second overlap region, the adjustment region including the first overlap region and the second overlap region.
In an exemplary embodiment, the updating module 240 is configured to generate the bullet screen adjustment information based on the adjustment area and the bullet screen information, where the bullet screen adjustment information includes: at least one of initial coordinates of the barrage, barrage moving speed, barrage moving direction, barrage transparency and barrage character size; and performing preventive shielding adjustment on the barrage of the adjustment area by using the barrage adjustment information to obtain updated barrage information.
According to the method, the first key frame and the second key frame are obtained by processing the initial video, a target detection model is utilized to determine a first boundary frame and a second boundary frame of target movement in the first key frame and the second key frame, an adjustment area is determined based on the first boundary frame and the second boundary frame, and a bullet screen is subjected to preventive shielding adjustment based on the adjustment area, so that a target synthetic video is further obtained. According to the method and the device, the adjustment area of the bullet screen information is determined through the first key frame and the second key frame, accuracy of the bullet screen information adjustment area in the moving process of the target moving object is improved to a certain extent, shielding of the bullet screen in the synthesized video to the target moving object is effectively reduced, and watching experience of a user is improved.
It should be understood that, in implementing the functions of the apparatus provided above, only the division of the above functional modules is illustrated, and in practical application, the above functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Fig. 10 is a block diagram of a structure of a terminal device according to an embodiment of the present application. The terminal device 1100 may be a portable mobile terminal such as: smart phones, tablet computers, players, notebook computers or desktop computers. Terminal device 1100 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.
In general, the terminal apparatus 1100 includes: a processor 1101 and a memory 1102.
The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the method of video processing provided by the method embodiment shown in fig. 2 in the present application.
In some embodiments, the terminal device 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102, and peripheral interface 1103 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, and a power supply 1109.
A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1104 may communicate with other terminal devices via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1104 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited in this application.
The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this time, the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1105 may be one and disposed on the front panel of the terminal device 1100; in other embodiments, the display 1105 may be at least two, and disposed on different surfaces of the terminal device 1100 or in a folded design; in other embodiments, the display 1105 may be a flexible display disposed on a curved surface or a folded surface of the terminal device 1100. Even more, the display 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1105 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.
The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. In general, a front camera is provided at a front panel of the terminal apparatus 1100, and a rear camera is provided at a rear surface of the terminal apparatus 1100. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1106 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuit 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal device 1100, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1107 may also include a headphone jack.
The power supply 1109 is used to supply power to the respective components in the terminal device 1100. The power source 1109 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1109 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal device 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, optical sensor 1115, and proximity sensor 1116.
The acceleration sensor 1111 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established in the terminal apparatus 1100. For example, the acceleration sensor 1111 may be configured to detect components of gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1111. Acceleration sensor 1111 may also be used for the acquisition of motion data of a game or a user.
The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal device 1100, and the gyro sensor 1112 may collect a 3D motion of the user on the terminal device 1100 in cooperation with the acceleration sensor 1111. The processor 1101 may implement the following functions based on the data collected by the gyro sensor 1112: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
The pressure sensor 1113 may be disposed at a side frame of the terminal device 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is provided at a side frame of the terminal apparatus 1100, a grip signal of the terminal apparatus 1100 by a user can be detected, and the processor 1101 performs left-right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The optical sensor 1115 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1115. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1105 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1105 is turned down. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 based on the intensity of ambient light collected by the optical sensor 1115.
A proximity sensor 1116, also referred to as a distance sensor, is typically provided on the front panel of the terminal device 1100. The proximity sensor 1116 is used to collect a distance between the user and the front surface of the terminal device 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal device 1100 gradually decreases, the processor 1101 controls the display 1105 to switch from the bright screen state to the off screen state; when the proximity sensor 1116 detects that the distance between the user and the front surface of the terminal apparatus 1100 gradually increases, the processor 1101 controls the display screen 1105 to switch from the off-screen state to the on-screen state.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is not limiting and that terminal device 1100 may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
Fig. 11 is a schematic structural diagram of a server provided in the embodiment of the present application, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1201 and one or more memories 1202, where at least one program code is stored in the one or more memories 1202, and the at least one program code is loaded and executed by the one or more processors 1201 to implement the method for video processing provided in the method embodiment shown in fig. 2. Of course, the server 1200 may also have a wired or wireless network interface, a keyboard, an input/output interface, etc. for performing input/output, and the server 1200 may also include other components for implementing device functions, which are not described herein.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one program code loaded and executed by a processor to cause a computer to implement the method of video processing provided by the method embodiment shown in fig. 2, described above.
Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program or a computer program product having at least one computer instruction stored therein is also provided, the at least one computer instruction being loaded and executed by a processor to cause the computer to implement the method of video processing provided by the method embodiment shown in fig. 2, described above.
It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, the initial video and bullet screen information referred to in this application are acquired with sufficient authorization.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to any modification, equivalents, or improvements made within the principles of the present application.
Claims (12)
1. A method of video processing, the method comprising:
acquiring an initial video, and processing the initial video to obtain a first key frame and a second key frame, wherein the first key frame and the second key frame contain moving objects;
detecting the moving object in the first key frame and the second key frame by using a target detection model to obtain a target moving object and a first boundary frame and a second boundary frame of the target moving object, wherein the first boundary frame is a boundary frame of the target moving object in the first key frame, and the second boundary frame is a boundary frame of the target moving object in the second key frame;
Acquiring bullet screen information, and determining an adjustment area based on the first boundary box, the second boundary box and the bullet screen information;
performing preventive shielding adjustment on the barrage information in the adjustment area to obtain updated barrage information;
and synthesizing the updated barrage information and the initial video to obtain a synthesized video.
2. The method of video processing according to claim 1, wherein said processing the initial video to obtain a first key frame and a second key frame comprises:
sampling the initial video based on sampling parameters of a decoder to obtain a preprocessed video;
acquiring motion vector information of the moving object based on the initial video;
based on the motion vector information, decoding the preprocessed video by using a decoder to obtain a plurality of key frames and time information corresponding to the key frames;
and detecting the key frames based on the time information to obtain the first key frames and the second key frames, wherein the first key frames are key frames which are detected to contain the mobile object for the first time, and the second key frames are key frames which contain the mobile object except the first key frames.
3. The method of video processing according to claim 1, wherein detecting the moving object in the first key frame and the second key frame using a target detection model to obtain a target moving object and a first bounding box and a second bounding box of the target moving object comprises:
based on the target detection model, respectively extracting the characteristics of the first key frame and the second key frame to obtain a first characteristic map and a second characteristic map;
identifying the moving object in the first characteristic spectrum and the second characteristic spectrum to obtain the target moving object;
determining the first boundary box of the target mobile object in the first characteristic map and the second boundary box of the target mobile object in the second characteristic map.
4. The method according to claim 3, wherein the feature extraction is performed on the first key frame and the second key frame based on the object detection model, respectively, to obtain a first feature map and a second feature map, including:
performing first convolution processing on the first key frame and the second key frame based on a first convolution kernel of the target detection model to obtain a first sub-feature map and a second sub-feature map;
And carrying out second convolution processing on the first sub-feature map and the second sub-feature map based on a second convolution kernel of the target detection model to obtain the first feature map and the second feature map.
5. A method of video processing according to claim 3, wherein said identifying the moving object in the first and second feature maps to obtain the target moving object comprises:
identifying the moving object in the first characteristic map and the second characteristic map to obtain a first confidence coefficient of the moving object, wherein the first confidence coefficient is used for representing the probability that the moving object is the target moving object;
and comparing the first confidence coefficient with a reference confidence coefficient, and determining the moving object as the target moving object if the first confidence coefficient is larger than or equal to the reference confidence coefficient.
6. A method of video processing according to claim 3, wherein said determining the first bounding box of the target mobile object in the first feature map and the second bounding box of the target mobile object in the second feature map comprises:
Acquiring coordinates of the target mobile object, and generating a boundary frame prediction result of the target mobile object based on the coordinates, wherein the boundary frame prediction result comprises a plurality of prediction boundary frames of the target mobile object and second confidence degrees corresponding to the prediction boundary frames, and the second confidence degrees are used for representing the probability of the target mobile object contained in the prediction boundary frames;
determining a reference boundary box based on the second confidence degrees, wherein the reference boundary box is a prediction boundary box corresponding to the second confidence degree with the largest second confidence degree among the second confidence degrees corresponding to the plurality of prediction boundary boxes;
calculating an intersection ratio of the respective prediction boundary frames and the reference boundary frame, and determining the first boundary frame and the second boundary frame of the target moving object from among the prediction boundary frames of the first feature map and the second feature map using the intersection ratio.
7. The method of video processing according to any one of claims 1-6, wherein the determining an adjustment area based on the first bounding box, the second bounding box, and the bullet screen information comprises:
calculating a first overlapping area of the bullet screen information and the first boundary box and a second overlapping area of the bullet screen information and the second boundary box;
The adjustment region is determined based on the first and second overlap regions, the adjustment region including the first and second overlap regions.
8. The method of video processing according to any one of claims 1 to 6, wherein said performing a preventive shielding adjustment on the bullet screen information in the adjustment area to obtain updated bullet screen information includes:
generating bullet screen adjustment information based on the adjustment region and the bullet screen information, the bullet screen adjustment information including: at least one of initial coordinates of the barrage, barrage moving speed, barrage moving direction, barrage transparency and barrage character size;
and utilizing the bullet screen adjusting information to carry out preventive shielding adjustment on the bullet screen in the adjusting area, and obtaining updated bullet screen information.
9. An apparatus for video processing, the apparatus comprising:
the decoding module is used for acquiring an initial video, processing the initial video to obtain a first key frame and a second key frame, wherein the first key frame and the second key frame contain moving objects;
the detection module is used for detecting the moving object in the first key frame and the second key frame by utilizing a target detection model to obtain a target moving object and a first boundary frame and a second boundary frame of the target moving object, wherein the first boundary frame is the boundary frame of the target moving object in the first key frame, and the second boundary frame is the boundary frame of the target moving object in the second key frame;
The determining module is used for acquiring bullet screen information and determining an adjustment area based on the first boundary frame, the second boundary frame and the bullet screen information;
the updating module is used for carrying out preventive shielding adjustment on the barrage information in the adjustment area to obtain updated barrage information;
and the synthesis module is used for synthesizing the updated barrage information and the initial video to obtain a synthesized video.
10. A computer device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded and executed by the processor to cause the computer device to implement the method of video processing as claimed in any one of claims 1 to 8.
11. A computer readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor to cause a computer to implement the method of video processing as claimed in any one of claims 1 to 8.
12. A computer program product, characterized in that at least one computer instruction is stored in the computer program product, which is loaded and executed by a processor to cause the computer to implement the method of video processing according to any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311523011.3A CN117579859A (en) | 2023-11-14 | 2023-11-14 | Video processing method, device, equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311523011.3A CN117579859A (en) | 2023-11-14 | 2023-11-14 | Video processing method, device, equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117579859A true CN117579859A (en) | 2024-02-20 |
Family
ID=89894755
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311523011.3A Pending CN117579859A (en) | 2023-11-14 | 2023-11-14 | Video processing method, device, equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117579859A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118115631A (en) * | 2024-04-25 | 2024-05-31 | 数梦万维(杭州)人工智能科技有限公司 | Image generation method, device, electronic equipment and computer readable medium |
-
2023
- 2023-11-14 CN CN202311523011.3A patent/CN117579859A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118115631A (en) * | 2024-04-25 | 2024-05-31 | 数梦万维(杭州)人工智能科技有限公司 | Image generation method, device, electronic equipment and computer readable medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110807361B (en) | Human body identification method, device, computer equipment and storage medium | |
CN110135336B (en) | Training method, device and storage medium for pedestrian generation model | |
CN110544272B (en) | Face tracking method, device, computer equipment and storage medium | |
CN113395542B (en) | Video generation method and device based on artificial intelligence, computer equipment and medium | |
CN111489378B (en) | Video frame feature extraction method and device, computer equipment and storage medium | |
CN111541907B (en) | Article display method, apparatus, device and storage medium | |
CN111242090B (en) | Human face recognition method, device, equipment and medium based on artificial intelligence | |
CN111860485B (en) | Training method of image recognition model, image recognition method, device and equipment | |
CN112749613B (en) | Video data processing method, device, computer equipment and storage medium | |
CN110147533B (en) | Encoding method, apparatus, device and storage medium | |
CN110991457B (en) | Two-dimensional code processing method and device, electronic equipment and storage medium | |
CN114359225A (en) | Image detection method, image detection device, computer equipment and storage medium | |
CN115497082A (en) | Method, apparatus and storage medium for determining subtitles in video | |
CN112818979B (en) | Text recognition method, device, equipment and storage medium | |
CN113570614A (en) | Image processing method, device, equipment and storage medium | |
CN117579859A (en) | Video processing method, device, equipment and readable storage medium | |
CN113763931B (en) | Waveform feature extraction method, waveform feature extraction device, computer equipment and storage medium | |
CN115129932A (en) | Video clip determination method, device, equipment and storage medium | |
CN112257594B (en) | Method and device for displaying multimedia data, computer equipment and storage medium | |
CN112528760B (en) | Image processing method, device, computer equipment and medium | |
CN111921199B (en) | Method, device, terminal and storage medium for detecting state of virtual object | |
CN110728167A (en) | Text detection method and device and computer readable storage medium | |
CN110232417B (en) | Image recognition method and device, computer equipment and computer readable storage medium | |
CN115168643B (en) | Audio processing method, device, equipment and computer readable storage medium | |
CN114462580B (en) | Training method of text recognition model, text recognition method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |