WO2024082971A1 - 一种视频处理方法及相关装置 - Google Patents
一种视频处理方法及相关装置 Download PDFInfo
- Publication number
- WO2024082971A1 WO2024082971A1 PCT/CN2023/123349 CN2023123349W WO2024082971A1 WO 2024082971 A1 WO2024082971 A1 WO 2024082971A1 CN 2023123349 W CN2023123349 W CN 2023123349W WO 2024082971 A1 WO2024082971 A1 WO 2024082971A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video frame
- computing power
- image
- video
- frame sequence
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 66
- 230000033001 locomotion Effects 0.000 claims description 231
- 238000012545 processing Methods 0.000 claims description 72
- 230000009466 transformation Effects 0.000 claims description 57
- 238000001914 filtration Methods 0.000 claims description 19
- 238000013145 classification model Methods 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 13
- 230000003044 adaptive effect Effects 0.000 claims description 12
- 238000005192 partition Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 70
- 230000000875 corresponding effect Effects 0.000 description 69
- 238000010586 diagram Methods 0.000 description 42
- 230000008569 process Effects 0.000 description 35
- 238000004422 calculation algorithm Methods 0.000 description 27
- 230000006835 compression Effects 0.000 description 26
- 238000007906 compression Methods 0.000 description 26
- 230000001276 controlling effect Effects 0.000 description 21
- 238000013527 convolutional neural network Methods 0.000 description 19
- 238000013135 deep learning Methods 0.000 description 16
- 230000011218 segmentation Effects 0.000 description 10
- 238000001514 detection method Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000003068 static effect Effects 0.000 description 7
- 238000007781 pre-processing Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000009182 swimming Effects 0.000 description 5
- 238000003066 decision tree Methods 0.000 description 4
- 229910003460 diamond Inorganic materials 0.000 description 4
- 239000010432 diamond Substances 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 238000009877 rendering Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000003708 edge detection Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100037812 Medium-wave-sensitive opsin 1 Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
- H04N19/14—Coding unit complexity, e.g. amount of activity or edge presence estimation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/103—Selection of coding mode or of prediction mode
- H04N19/105—Selection of the reference unit for prediction within a chosen coding or prediction mode, e.g. adaptive choice of position and number of pixels used for prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/119—Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
- H04N19/137—Motion inside a coding unit, e.g. average field, frame or block difference
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/142—Detection of scene cut or scene change
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/176—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/523—Motion estimation or motion compensation with sub-pixel accuracy
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/91—Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
Definitions
- the present application relates to the field of data processing technology, and in particular to video processing technology.
- video encoding has excellent encoding capabilities that can provide products with high-definition and smooth playback experience.
- the video encoding kernel sets fixed encoding parameters when applied, and uses the same encoding parameters for video encoding of various input video sources.
- the encoding parameters affect the frame stability of the video encoding kernel. The more and more complex the encoding parameters are, the higher the video frame stability is, but the corresponding video encoding kernel computing power required is also more.
- Live broadcast, real-time video communication, cloud rendering, cloud desktop and other scenes have high requirements for the frame stability of the video encoding kernel. Static pictures with small motion texture changes consume relatively less encoding computing power, while pictures with complex motion textures and frequent scene switching consume relatively more encoding computing power.
- the same encoding parameters are used for the video. If the encoding parameters are set more and more complex, the static pictures with small motion texture changes will result in higher server deployment costs. If the encoding parameters are set less and simpler, the video encoding compression computing power will be insufficient for pictures with complex motion textures and frequent scene switching, resulting in poor frame stability of the video encoding kernel.
- the embodiments of the present application provide a video processing method and related devices, which can adaptively adjust the encoding parameters of the video so that the adjusted encoding parameters meet the corresponding encoding requirements, improve frame output stability, and reduce server deployment costs.
- One aspect of the present application provides a video processing method, which is executed by a computer device, comprising:
- each video frame sequence includes at least one video frame image, and N is an integer greater than 1;
- a video processing device comprising:
- a video frame sequence acquisition module used to acquire N video frame sequences of an input video, wherein each video frame sequence includes at least one video frame image, and N is an integer greater than 1;
- a video frame sequence extraction module is used to obtain the i-th video frame sequence and the i-1-th video frame sequence adjacent thereto from N video frame sequences, where i is an integer greater than 1;
- a video frame image acquisition module used to acquire a first video frame image from the i-th video frame sequence, and acquire a second video frame image from the i-1-th video frame sequence, wherein the first video frame image corresponds to a first image attribute, and the second video frame image corresponds to a second image attribute;
- a computing power acquisition module used to acquire a first computing power corresponding to the i-1th video frame sequence, wherein the first computing power is used to represent the computing power consumed when encoding and/or decoding the i-1th video frame sequence;
- the video encoding parameter determination module is used to determine the encoding parameters of the i-th video frame sequence according to the first computing power, the first image attribute and the second image attribute.
- Another aspect of the present application provides a computer device, comprising:
- the memory is used to store programs
- the processor is used to execute the program in the memory, including executing the above-mentioned methods;
- the bus system is used to connect the memory and the processor so that the memory and the processor can communicate with each other.
- Another aspect of the present application provides a computer-readable storage medium, in which instructions are stored.
- the computer-readable storage medium is run on a computer, the computer is enabled to execute the above-mentioned methods.
- Another aspect of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium.
- a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods provided by the above aspects.
- the present application provides a video processing method and a related device, which includes: first, obtaining N video frame sequences of an input video, wherein each video frame sequence includes at least one video frame image; second, obtaining the i-th video frame sequence and the i-1-th video frame sequence adjacent to it from the N video frame sequences; third, obtaining a first video frame image from the i-th video frame sequence, and obtaining a second video frame image from the i-1-th video frame sequence, wherein the first video frame image corresponds to a first image attribute, and the second video frame image corresponds to a second image attribute; then, obtaining a first computing power corresponding to the i-1-th video frame sequence, wherein the first computing power is used to characterize the computing power consumed when encoding and/or decoding the i-1-th video frame sequence; then, according to at least one of the first computing power, the first image attribute and the second image attribute, determining the encoding parameters of the i-th video frame sequence, and the encoding parameters are used
- the video processing method provided in the embodiment of the present application decomposes the encoding task of the input video into encoding the N video frame sequences contained in the input video respectively, and when encoding the current video frame sequence (i.e., the i-th video frame sequence), according to the computing power consumed when encoding and/or decoding the previous video frame sequence (i.e., the i-1th video frame sequence), the current video frame sequence is encoded.
- the encoding parameters of the current video frame sequence are adaptively determined based on at least one of the first image attributes of the first video frame image in the sequence and the second image attributes of the second video frame image in the previous video frame sequence.
- the encoding parameters of the current video frame sequence are set so that the determined encoding parameters can meet the encoding requirements of the current video frame sequence, improve frame stability, and reduce server deployment costs.
- FIG1 is a schematic diagram of an architecture of a video processing system provided by an embodiment of the present application.
- FIG2a is a flow chart of a video processing method provided by an embodiment of the present application.
- FIG2b is a schematic diagram of a video processing method according to an embodiment of the present application.
- FIG3a is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG3 b is a schematic diagram of a video processing method according to another embodiment of the present application.
- FIG4 is a schematic diagram of depth division of coding units provided in an embodiment of the present application.
- FIG5a is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG5b is a schematic diagram of a video processing method according to another embodiment of the present application.
- FIG6 is a schematic diagram of depth division of prediction units provided by an embodiment of the present application.
- FIG7a is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG7b is a schematic diagram of a video processing method according to another embodiment of the present application.
- FIG8 is a schematic diagram of motion estimation provided by an embodiment of the present application.
- FIG9 is a schematic diagram of motion compensation provided by an embodiment of the present application.
- FIG10a is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG10b is a schematic diagram of a video processing method according to another embodiment of the present application.
- FIG11 is a schematic diagram of depth division of a transform unit provided by an embodiment of the present application.
- FIG12 is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG13 is a schematic diagram of encoding a target video frame image provided by an embodiment of the present application.
- FIG14 is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG15 is a schematic diagram of a coding framework provided in an embodiment of the present application.
- FIG16 is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG17 is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG18 is a flow chart of a video processing method provided by another embodiment of the present application.
- FIG19 is a flowchart of a video processing method provided by another embodiment of the present application.
- FIG21 is a schematic diagram of the structure of a video processing device provided by another embodiment of the present application.
- FIG22 is a schematic diagram of the structure of a video processing device provided by another embodiment of the present application.
- FIG23 is a schematic diagram of the structure of a video processing device provided by another embodiment of the present application.
- FIG24 is a schematic diagram of the structure of a video processing device provided by another embodiment of the present application.
- FIG25 is a schematic diagram of the structure of a video processing device provided by another embodiment of the present application.
- FIG. 26 is a schematic diagram of a server structure provided in accordance with an embodiment of the present application.
- Video Encoding It is used to convert files in the original video format into files in another video format through compression technology.
- the most important codec standards in video streaming are H.261, H.263, and H.264 of the ITU.
- H.264 is a new generation of coding standard, known for its high compression, high quality and support for streaming media transmission on multiple networks.
- the H.264 protocol defines three types of frames: fully encoded frames are I frames, frames that refer to the previous I frame and contain only the difference part of the encoding are P frames, and frames that refer to the previous and next frames are B frames.
- the core algorithms used by H.264 are intra-frame compression and inter-frame compression. Intra-frame compression is the algorithm for generating I frames, and inter-frame compression is the algorithm for generating B frames and P frames.
- H.264 images are organized in sequences.
- a sequence is a data stream after encoding a segment of images, starting with an I frame and ending with the next I frame.
- the first image in a sequence is called an IDR image (immediate refresh image), and IDR images are all I frame images.
- IDR image immediate refresh image
- H.264 introduces IDR images for decoding resynchronization.
- the decoder decodes an IDR image, it immediately clears the reference frame queue, outputs or discards all decoded data, re-searches the parameter set, and starts a new sequence. In this way, if a major error occurs in the previous sequence, the IDR image can provide an opportunity for resynchronization.
- the image after the IDR image will not be decoded using the data of the image before the IDR.
- a sequence is a string of data streams generated after encoding a segment of images with relatively small content differences.
- a sequence can be very long, because a small motion change means that the content of the image screen changes very little, so an I frame can be encoded, and then P frames and B frames can be encoded.
- a sequence may be relatively short, for example, it contains an I frame and 3 or 4 P frames.
- IDR frame In video encoding (H.264/H.265/H.266/AV1, etc.), images are organized in sequences. The first image in a sequence is an IDR image (Immediate Refresh Image), and IDR images are all I-frame images.
- I frame Intra-coded frame, I frame means key frame, which can be understood as the complete preservation of this frame; decoding only requires the data of this frame (because it contains the complete picture).
- IDR frame images must be I frame images, but I frame images are not necessarily IDR frame images. There can be many I frame images in a sequence, and the images after the I frame image can refer to the images between it and the I frame image as motion reference.
- P frame forward predictive coding frame.
- P frame represents the difference between this frame and the previous key frame (or P frame).
- the previously cached picture needs to be superimposed with the difference defined by this frame to generate the final picture.
- P frame uses I frame as reference frame, finds the prediction value and motion vector of "some point” of P frame in I frame, takes the prediction difference and motion vector and transmits them together. At the receiving end, finds the prediction value of "some point” of P frame from I frame according to the motion vector, and adds it to the prediction difference to get the sample value of "some point” of P frame, so as to get the complete P frame.
- B frame Bidirectional predictive interpolation coding frame.
- B frame is a bidirectional difference frame, that is, B frame records the difference between the current frame and the previous and next frames.
- B frame can be used as a reference frame for other B frames, or it can not be used as a reference frame for other B frames.
- To decode a B frame not only the previous cached picture must be obtained, but also the subsequent picture must be decoded, and the final picture is obtained by superimposing the previous and next pictures with the current frame data.
- B frame has a high compression rate, but the CPU (Central Processing Unit) consumption is high during decoding.
- CPU Central Processing Unit
- B frames use the previous I frame or P frame and the following P frame as reference frames, "find” the prediction value of "a certain point” in the B frame and two motion vectors, and take the prediction difference and motion vector for transmission.
- the receiving end “finds (calculates)” the prediction value in the two reference frames based on the motion vector and sums it with the difference to obtain the sample value of "a certain point” in the B frame, thereby obtaining a complete B frame.
- Macroblock The basic unit of coding. A coded image needs to be divided into multiple blocks for processing.
- Intra-frame prediction The predicted block is a block formed based on the coded reconstructed block and the current block.
- Intraframe compression is also called spatial compression. When compressing a frame of image, only the data of this frame is considered without considering the redundant information between adjacent frames, which is similar to static image compression. Intraframe compression generally uses a lossy compression algorithm. Since intraframe compression encodes a complete image, it can be decoded and displayed independently. Intraframe compression generally does not achieve a very high compression rate.
- Inter-frame prediction mainly includes motion estimation (motion search method, motion estimation criteria, sub-pixel interpolation and motion vector estimation) and motion compensation. It is a reference and prediction interpolation compensation on the GOP (group of pictures) granularity timing.
- interframe compression is that the data of adjacent frames are highly correlated, or the information of the previous and next frames has a characteristic of little change. That is, there is redundant information between consecutive video frames or adjacent frames. Based on this characteristic, compressing the redundancy between adjacent frames can further improve the compression rate and reduce the compression ratio.
- Interframe compression is also called temporal compression, which compresses the data by comparing the data between different frames on the timeline.
- Inter-frame compression is generally lossless.
- the frame differencing algorithm is a typical time compression method. It compares the difference between the current frame and the adjacent frames and only records the difference between the current frame and its adjacent frames, which can greatly reduce the amount of data.
- SATD Absolute Transformed Difference
- MC Motion Compensation
- ME Motion Estimation (ME).
- Lookahead It estimates the encoding cost of frames that have not been analyzed by the main encoder module, caches a certain configured length of encoded reconstructed frames before the current encoding evaluation frame, and performs inter-frame prediction reference evaluation for the current encoding frame.
- bd-rate One of the main parameters for evaluating the performance of video encoding algorithms, indicating the changes in bit rate and peak signal-to-noise ratio (PSNR) of the video encoded by the new algorithm compared with the original algorithm
- GOP group of pictures The interval between two I frames.
- Minigop In a GOP, there will be a certain amount of data B frames between two P frames. The interval between two P frames is a minigop.
- Video encoding is the basis of video processing. Excellent encoding capabilities can provide products with high-definition and smooth playback experience, which plays an important role in improving the quality of experience (QoE) and quality of service (QoS).
- QoE quality of experience
- QoS quality of service
- Scenarios such as live broadcast, real-time communication (RTC), cloud rendering, and cloud desktop have relatively high requirements for the frame stability of the video encoding kernel.
- the computing power of the video encoding kernel is related to the complexity of the video image. Static and moving images with small texture changes are easier to compress and consume relatively less encoding computing power. However, for images with complex motion textures, the compression encoding computing power consumption is relatively large. If the texture of the compressed video image is relatively complex and the scene switching is relatively frequent, the computing power consumed for video encoding compression will be relatively uneven. Large fluctuations in computing power will lead to large fluctuations in the CPU consumption of the encoding processing server.
- the existing video encoding kernel sets the relevant encoding parameters (such as encoding complexity, bit rate, number of lookahead reference frames, KEY GOP size, whether to enable B frames, encoding code control mode, ME, MC related algorithms, whether to enable related algorithms in preprocessing, etc.) when these encoding parameters are set.
- some processing algorithms and configurations related to encoding are fixed when the video source is input for encoding, such as high-computing encoding unit division, MC, ME, transformation, preprocessing, lookahead, etc.
- the same encoding parameters are used for each frame in the video.
- the encoding parameters are set more and more complex, it will lead to higher server deployment costs for static images with small motion texture changes. If the encoding parameters are set less and simpler, it will lead to insufficient computing power for video encoding compression for images with complex motion textures and frequent scene switching, resulting in poor frame stability of the video encoding kernel.
- the video processing method provided in the embodiment of the present application decomposes the encoding task of the input video into encoding the N video frame sequences contained in the input video respectively, and when encoding the current video frame sequence (i.e., the i-th video frame sequence), according to the computing power consumed when encoding and/or decoding the previous video frame sequence (i.e., the i-1th video frame sequence), the current
- the encoding parameters of the current video frame sequence can be adaptively determined based on at least one of the first image attribute of the first video frame image in the video frame sequence and the second image attribute of the second video frame image in the previous video frame sequence.
- the encoding parameters of the current video frame sequence are set so that the determined encoding parameters can meet the encoding requirements of the current video frame sequence, improve frame output stability, and reduce server deployment costs.
- FIG. 1 is an application environment diagram of the video processing method provided in the embodiment of the present application.
- the video processing method provided in the embodiment of the present application is applied to a video processing system.
- the video processing system includes: a server and a terminal device; wherein the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
- the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
- CDN Content Delivery Network
- the terminal can be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited to this.
- the terminal and the server can be directly or indirectly connected by wired or wireless communication, and the embodiment of the present application is not limited here.
- the server first obtains N video frame sequences of the input video, wherein each video frame sequence includes at least one video frame image; secondly, the server obtains the i-th video frame sequence and the i-1-th video frame sequence adjacent to it from the N video frame sequences; thirdly, the server obtains the first video frame image from the i-th video frame sequence, and obtains the second video frame image from the i-1-th video frame sequence, wherein the first video frame image corresponds to a first image attribute, and the second video frame image corresponds to a second image attribute; then, the server obtains the first computing power corresponding to the i-1-th video frame sequence, wherein the first computing power is used to characterize the computing power consumed when encoding the i-1-th video frame sequence; then, the server determines the encoding parameters of the i-th video frame sequence according to at least one of the first computing power, the first image attribute, and the second image attribute corresponding to the i-1-th video frame sequence, so as to encode the i-th video frame
- the video processing method in the present application is introduced below.
- the execution subject of the video processing method is a computer device, for example, a server.
- Figure 2a is a flow chart of the video processing method provided in the embodiment of the present application
- Figure 2b is a schematic diagram of the implementation architecture of the video processing method provided in the embodiment of the present application.
- the video processing method provided in the embodiment of the present application includes: Step S110 to Step S160. Specifically:
- Each video frame sequence includes at least one video frame image, and N is an integer greater than 1.
- the input video is a video to be processed or a video to be encoded.
- the computer device used to execute the method provided in the embodiment of the present application may first obtain the input video, and then segment the input video to obtain the N video frame sequences in the input video.
- other devices may perform the segmentation process on the input video to obtain the N video frame sequences in the input video, and then transmit the N video frame sequences in the input video to the computer device used to execute the method provided in the embodiment of the present application.
- the computer device of the method is configured to obtain N video frame sequences in the input video.
- the embodiment of the present application does not limit the method for obtaining the N video frame sequences in the input video.
- the input video may be segmented by a scene recognition model, by setting a fixed time or frame number, or by manually segmenting the input video, which is not limited in the embodiments of the present application.
- Segmenting the input video through a scene recognition model means that N scenes in the input video are identified through a trained scene recognition model, and the input video is segmented using scene switching as the video segmentation point to obtain N sub-video segments, and the sub-video segments are represented in the form of video frame sequences, each of which includes all frame images in the sub-video segment.
- the scene recognition model can be a feature extraction and classification model based on deep learning and neural networks. Commonly used algorithms include CNN (Convolutional Neural Network), decision tree, random forest, etc. Taking deep learning as an example, deep learning CNN, CNN+RNN (Recurrent Neural Network) and other algorithm frameworks are used to generate images from video frames, and the images generated from video frames are used as training samples for deep learning related models.
- Segmenting the input video by setting a fixed time or number of frames means that the input video is segmented by the set value by presetting the video segmentation time or the video segmentation frame number, thereby obtaining a number of segmented sub-video segments, and representing the sub-video segments in the form of video frame sequences, each video frame sequence including all frame images in the sub-video segment.
- Manually segmenting the input video means that the input video is segmented by manually taking scene switching in the video as video segmentation points to obtain N sub-video segments, and the sub-video segments are represented in the form of video frame sequences, each of which includes all frame images in the sub-video segment.
- the input video contains three scenes: a meeting scene, a theater scene, and a swimming scene, wherein the scene complexity of the meeting scene is simple, the scene complexity of the theater scene is medium, and the scene complexity of the swimming scene is complex.
- the method of segmenting the input video by the scene recognition model is: the input video is used as the input of the trained scene recognition model, the scene recognition model recognizes that the input video contains three scenes, outputs three sub-video segments corresponding to the three scenes, and each sub-video segment is represented in the form of a video frame sequence.
- the method of segmenting the input video by setting a fixed time or number of frames is: set every 15 seconds as the video segmentation interval, segment the input video, and obtain several sub-video segments, so that the time of each sub-video segment is 15 seconds or less (such as the duration of the last sub-video segment of the input video can be less than 15 seconds).
- Segmenting the input video manually means that the input video is segmented by manually using the scene switching in the video as the video segmentation point, and the video segmentation is performed at each scene switching, and an input video is segmented into three sub-video segments, and each sub-video segment is represented in the form of a video frame sequence.
- the i-th video frame sequence and the i-1-th video frame sequence are adjacent to each other in the target video, and i is an integer greater than 1, for example, i is a natural number greater than 1.
- the target video is divided into 5 video frame sequences, and the first video frame sequence and the second video frame sequence are obtained, or the second video frame sequence and the third video frame sequence are obtained, or the third video frame sequence and the fourth video frame sequence are obtained, or the fourth video frame sequence and the fifth video frame sequence are obtained.
- the first video frame image corresponds to a first image attribute
- the second video frame image corresponds to a second image attribute
- the first image attribute and the second image attribute are respectively used to characterize the texture complexity information and/or scene complexity information of the corresponding video frame image.
- the first video frame image can be an IDR frame in the i-th video frame sequence
- the second video frame image can be an IDR frame in the i-1-th video frame sequence.
- Each video frame image corresponds to an image attribute, and the image attribute is used to characterize the texture complexity information and scene complexity information of the video frame image.
- the basic image attributes corresponding to the video frame image may include pixels, resolution, size, color, bit depth, hue, saturation, brightness, color channels, image levels, etc.; in an embodiment of the present application, the first image attribute and the second image attribute can be at least one of the above-mentioned basic image attributes, or a combination of multiple ones, or other forms of attribute representation information determined based on at least one or more basic image attributes, which can characterize the texture complexity information and/or scene complexity information of the corresponding video frame image.
- the image attributes corresponding to the video frame image can also be represented by the computing resources consumed by decoding the video frame image. In some cases, the current input video to be encoded needs to be obtained through decoding processing.
- the computing resources consumed by decoding each video frame image can be recorded, and then the image attributes corresponding to the video frame image can be determined accordingly. It should be understood that the more computing resources consumed in decoding a certain video frame image, the higher the texture complexity and/or the higher the scene complexity corresponding to the video frame image. That is, as long as the first image attribute and the second image attribute in the embodiment of the present application can represent the texture complexity information and/or scene complexity information of the corresponding video frame image, the embodiment of the present application does not impose any limitation on their expression form.
- the above-mentioned first image attribute can be obtained by identifying or extracting the first video frame image after acquiring the first video frame image; or it can also be predetermined and stored information. After acquiring the first video frame image, the first image attribute corresponding to the first video frame image can be retrieved from the stored information.
- the above-mentioned second image attribute can be obtained by identifying or extracting the second video frame image after acquiring the second video frame image; or it can also be predetermined and stored information. After acquiring the second video frame image, the second image attribute corresponding to the second video frame image can be retrieved from the stored information.
- the embodiment of the present application does not impose any limitation on the method of acquiring the first image attribute and the second image attribute.
- Texture complexity information includes simple texture, general texture, medium texture and complex texture.
- fewer and simpler encoding parameters can be used for encoding, and less computing power is consumed;
- encoding video frames with complex textures in order to ensure the encoding quality (higher frame stability), more and more complex encoding parameters need to be used for encoding, and more computing power is consumed.
- Texture complexity information analysis methods include: Euclidean distance, statistical histogram, LBP (Local Binary Pattern) detection algorithm, CNN feature extraction and classification algorithms, and image complexity estimation methods based on edge features.
- Commonly used algorithms include canny, sobel, robert and other edge detection operator algorithms, etc. This application does not make any restrictions here.
- LBP refers to the local binary pattern, which is an operator used to describe the local features of an image.
- the LBP feature has significant advantages such as grayscale invariance and rotation invariance.
- the original LBP operator is defined as a 3 ⁇ 3 window, with the center pixel of the window as the threshold, comparing the grayscale values of the 8 adjacent pixels with it. If the surrounding pixel value is greater than the center pixel value, the position of the pixel is marked as 1, otherwise it is 0.
- the 8 pixels in the 3 ⁇ 3 neighborhood are After comparison, the pixels can generate an 8-bit binary number (usually converted to a decimal number, i.e., LBP code, with a total of 256 types), that is, the LBP value of the central pixel of the window is obtained, and this value is used to reflect the texture information of the area.
- the detection area can be automatically adjusted according to the video screen resolution and computing power, or the screen resolution can be adjusted by downsampling.
- the LBP values of the area detected by the comprehensive screen are summarized and calculated. For example, if 80% of the LBP values are concentrated, those below 50 are simple textures, those between 50 and 100 are general textures, those between 100 and 150 are medium textures, and those greater than 150 are complex textures.
- Scene complexity information includes simple scenes, general scenes, medium scenes and complex scenes.
- simple scenes include desktop scenes, meeting scenes, etc.
- general scenes include show scenes, TV drama scenes, etc.
- medium scenes include animation scenes, outdoor scenes, etc.
- complex scenes include game scenes, swimming scenes, etc.
- the scene complexity information can be classified into four categories: simple scenes, general scenes, medium scenes and complex scenes using a feature extraction and classification network based on deep learning and neural networks.
- Commonly used algorithms include CNN, decision tree, random forest, etc.
- deep learning CNN, CNN+RNN and other algorithm frameworks are used to recognize video frame images, and video frame images are used as training samples for deep learning related models.
- the accuracy of scene recognition through a pure CNN network (convolutional layer, filtering, pooling layer, etc.) model with obvious picture features can reach more than 99%; for scenes with relatively scattered picture features (such as TV drama scenes, outdoor sports scenes, food scenes, travel scenes, etc.), the accuracy of scene recognition through CNN combined with RNN+LSTM for time domain + frequency domain analysis can reach about 90%.
- the first computing power is used to characterize the computing power consumed when encoding and/or decoding the i-1th video frame sequence.
- it can be the server computing power consumed when performing video encoding, audio encoding, video decoding, audio decoding, etc. on the i-1th video frame sequence.
- the first computing power correspondingly refers to the computing power consumed by the device, and the embodiments of the present application do not impose any limitations on this.
- the purpose of the embodiment of the present application is to perform video encoding on i video frame sequences.
- the task of performing video encoding on the i-1th video frame sequence has been completed, and after completing the task of performing video encoding on the i-1th video frame sequence, the server consumption value for performing video encoding on the i-1th video frame sequence (the server computing amount occupied during video encoding) is calculated, and the calculated server consumption value is used as the first computing power corresponding to the i-1th video frame sequence.
- the server consumption value for performing video decoding on the i-1th video frame sequence (the server computing amount occupied during video encoding) can also be calculated, and the calculated server consumption value is used as the first computing power corresponding to the i-1th video frame sequence.
- the first computing power when the first computing power is the computing power consumed when decoding the i-1th video frame sequence, the first computing power can represent the texture complexity and/or scene complexity of the video frame image in the i-1th video frame sequence, that is, the larger the first computing power, the higher the texture complexity and/or scene complexity of the video frame image in the i-1th video frame sequence. The higher.
- S150 Determine an encoding parameter of the i-th video frame sequence according to at least one of the first computing power, the first image attribute, and the second image attribute.
- the encoding parameters of the i-1th video frame sequence are adjusted and determined based on the video encoding parameters of the i-1th video frame sequence.
- the first computing power threshold refers to a computing power threshold provided by the server for encoding.
- the first computing power threshold provided by the server for encoding may be 1000, and the first computing power corresponding to the i-1-th video frame sequence exceeds 1000, then the computing power needs to be reduced when encoding the i-th video frame sequence.
- the encoding parameters of the i-th video frame sequence need to be lowered.
- the encoding parameters of the i-th video frame sequence are lowered, wherein the first computing power threshold is greater than the second computing power threshold.
- the first computing power threshold and the second computing power threshold refer to two different computing power thresholds provided by the server for encoding.
- the computing power range provided by the server for encoding is 800 to 1000, that is, the first computing power threshold can be 1000, the second computing power threshold can be 800, and the first computing power corresponding to the i-1th video frame sequence is 900, and the first computing power is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the first image attribute is greater than the attribute level of the second image attribute, which means that the computing power required for encoding the i-th video frame sequence will be greater than the computing power for encoding the i-1-th video frame sequence, and the computing power for encoding the i-th video frame sequence needs to be reduced.
- the encoding parameters of the i-th video frame sequence need to be lowered.
- the encoding parameters of the i-th video frame sequence are kept the same as the encoding parameters of the i-1th video frame sequence.
- the first computing power threshold and the second computing power threshold refer to two different computing power thresholds provided by the server for encoding.
- the computing power interval provided by the server for encoding is 800 to 1000, that is, the first computing power threshold can be 1000, the second computing power threshold can be 800, the first computing power corresponding to the i-1th video frame sequence is 900, and the first computing power is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the first image attribute is equal to the attribute level of the second image attribute, indicating that the computing power required to encode the i-th video frame sequence will be equal to or close to the computing power when encoding the i-1th video frame sequence, without adjusting the encoding parameters of the i-th video frame sequence.
- the first computing power threshold and the second computing power threshold refer to two different computing power thresholds provided by the server for encoding.
- the computing power range provided by the server for encoding is 800 to 1000, that is, the first computing power threshold can be 1000, the second computing power threshold can be 800, and the first computing power for encoding the i-1-th video frame sequence is 900.
- the first computing power is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the first image attribute is less than the attribute level of the second image attribute, which means that the computing power required to encode the i-th video frame sequence is less than the second computing power threshold.
- the computing power will be less than the computing power when encoding the i-1th video frame sequence.
- the encoding parameters of the i-th video frame sequence can be increased.
- the second computing power threshold refers to a computing power threshold provided by the server for encoding.
- the second computing power threshold provided by the server for encoding may be 800, and the first computing power for encoding the i-1th video frame sequence is 700, and the first computing power is less than the second computing power threshold.
- the encoding parameters of the i-th video frame sequence may be adjusted upward.
- the above-mentioned coding parameters may specifically be video coding parameters, audio coding parameters, etc.
- the coding parameters may include but are not limited to coding unit division depth, prediction unit division depth, motion estimation parameters, motion compensation parameters, transform unit division depth, etc.
- the embodiment of the present application does not impose any limitation on the specific contents of the coding parameters.
- encoding the i-th video frame sequence according to the encoding parameters of the i-th video frame sequence specifically means encoding all video frame images in the i-th video frame sequence according to the encoding parameters of the i-th video frame sequence to obtain an encoded image of each video frame image, and all the encoded images constitute the i-th encoded video segment.
- step S160 is an optional step.
- the above steps S110 to S150 may be performed by a certain computer device to determine the encoding parameters of the i-th video frame sequence, and then, the encoding parameters may be transmitted to other video encoding devices, so that the video encoding device encodes the i-th video frame sequence in the input video according to the encoding parameters to obtain the i-th encoded video fragment.
- the above steps S110 to S160 may be performed by the same computer device, that is, the same computer device determines the encoding parameters of the i-th video frame sequence, and encodes the i-th video frame sequence in the input video accordingly to obtain the i-th encoded video fragment.
- the embodiment of the present application does not impose any limitation on the execution subject of the encoding process of the i-th video frame sequence.
- the video data before encoding is referred to as a video frame sequence
- the video data after encoding is referred to as an encoded video segment.
- the video frame sequence includes video frame images that are independent of each other
- the encoded video segment includes encoding data corresponding to each video frame image
- the encoding data can represent the content of the video frame image itself, and can also represent the difference between the video frame image and other adjacent video frame images.
- the video frame sequence and the encoded video segment have different formats, and the encoded video segment can be understood as a compressed video frame sequence, and the file size of the encoded video segment is smaller than the video frame sequence.
- the video processing method provided in the embodiment of the present application when encoding the input video, if the scene in the input video is unique and fixed, the encoding parameters are matched according to the scene, and the input video is encoded by the matched encoding parameters, so that the encoding parameters meet the encoding requirements of the current video frame sequence, improve the frame output stability, and reduce the server deployment cost. If the input video includes multiple scenes, the input video is segmented according to the scene changes to obtain N video frame sequences, and the encoding task of the input video is decomposed into encoding the N video frame sequences contained in the input video respectively.
- the current video frame sequence i.e., the i-th video frame sequence
- the previous video frame sequence i.e., the i-1th video frame sequence
- the first image attribute of the first video frame image in the current video frame sequence i.e., the current video frame sequence
- at least one of the second image attributes of the second video frame image in the previous video frame sequence Adaptively determine the encoding parameters of the current video frame sequence.
- the encoding parameters of the current video frame sequence are set so that the determined encoding parameters can meet the encoding requirements of the current video frame sequence, improve frame output stability, and reduce server deployment costs.
- the encoding parameters include the coding unit division depth
- FIG. 3a is a schematic diagram of the determination process of the coding unit division depth
- FIG. 3b is a schematic diagram of the determination architecture of the coding unit division depth
- step S150 includes sub-steps S1511 to S1514. Specifically:
- FIG 4 is a schematic diagram of the depth division of the coding unit provided in the embodiment of the present application.
- the video frame image is sent to the encoder, first divided into coding tree units (Coding Tree Uint, CTU) according to the 64 ⁇ 64 block size, and then each CTU is deeply divided to obtain a coding unit (Coding Uint, CU), that is, the coding unit is a finer-grained unit obtained by deeply dividing the coding tree unit.
- the depth division of each CTU adopts the top-down division rule, as shown in Table 1.
- the larger the coding unit division depth the greater the computing power required to encode the image. When the computing power needs to be reduced, the division depth of the coding unit can be reduced.
- Condition 1 The first computing power is greater than the first computing power threshold.
- Condition 2 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute.
- the first computing power threshold and the second computing power threshold refer to two different computing power thresholds provided for encoding, which can be independently defined according to the computing power consumption in actual applications.
- the computing power range provided by the server for encoding is 800 to 1000, that is, the first computing power threshold can be set to 1000, and the second computing power threshold can be set to 800.
- the above-mentioned attribute level is determined according to the corresponding image attributes.
- the attribute level is used to characterize the texture complexity level and/or scene complexity level of the video frame image; generally, the higher the attribute level, the higher the texture complexity and/or the higher the scene complexity of the corresponding video frame image.
- the attribute level can be determined based on the video frame image itself and/or the image attribute of the video frame image through a neural network model.
- the attribute level corresponding to the image attribute can be determined based on the image attribute of the video frame image according to a pre-set attribute level classification rule. The embodiment of the present application does not impose any limitation on the method for determining the attribute level.
- the attribute level of the image attribute may include three levels, namely, level one, level two, and level three.
- level one the higher the texture complexity of the corresponding video frame image and/or the higher the scene complexity.
- level two the higher the texture complexity of the corresponding video frame image and/or the higher the scene complexity.
- level three the higher the texture complexity of the corresponding video frame image and/or the higher the scene complexity.
- more or fewer attribute levels may also be divided, and the embodiments of the present application do not impose any limitation on this.
- the first computing power consumed by the video encoding of the i-1th video frame sequence exceeds the first computing power threshold, and the computing power of the video encoding of the i-th video frame sequence needs to be reduced.
- the first computing power of the i-1th video frame sequence for video encoding is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is less than the attribute level of the first video frame image in the i-th video frame sequence (texture complexity information and scene The complexity information is higher than the texture complexity information and scene complexity information of the second video frame image), and the computing power of video encoding of the ith video frame sequence needs to be reduced.
- condition three there is no need to adjust the computing power when encoding the i-th video frame sequence.
- the first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is equal to the attribute level of the second image attribute. Since there is no need to adjust the computing power when encoding the i-th video frame sequence, the first coding unit division depth of the i-th video frame sequence can be kept equal to the second coding unit division depth of the i-1-th video frame sequence.
- the first computing power for video encoding of the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is equal to the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are unchanged relative to the texture complexity information and scene complexity information of the second video frame image), and there is no need to adjust the computing power of video encoding of the i-th video frame sequence.
- the computing power of the i-th video frame sequence during video encoding can be increased.
- Condition 4 The first computing power is less than the second computing power threshold.
- Condition 5 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is less than the attribute level of the second image attribute.
- the first computing power for video encoding of the i-1th video frame sequence is less than the second computing power threshold.
- the computing power for video encoding of the i-th video frame sequence can be increased.
- the first computing power for encoding the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is greater than the attribute level of the first video frame image in the i-1th video frame sequence (texture complexity information of the first video frame image and The scene complexity information is lower than the texture complexity information and scene complexity information of the second video frame image.
- the computing power for video encoding of the ith video frame sequence can be improved.
- the method provided in the embodiment of the present application when encoding the current video frame sequence, adaptively adjusts the coding unit division depth of the current video frame sequence according to the server computing power consumed by the previous video frame sequence, the first image attribute of the first video frame image in the current video frame sequence, and the second image attribute of the second video frame image in the previous video frame sequence, so that the adjusted coding unit division depth can meet the encoding requirements of the current video frame sequence, improve frame output stability, and reduce server deployment costs.
- the encoding parameters include the prediction unit division depth
- FIG. 5a is a schematic diagram of the determination process of the prediction unit division depth
- FIG. 5b is a schematic diagram of the determination architecture of the prediction unit division depth
- step S150 includes sub-steps S1521 to S1524. Specifically:
- FIG. 6 is a schematic diagram of the depth division of the prediction unit provided in an embodiment of the present application.
- the video frame image is sent to the encoder, first divided into coding tree units CTU according to a 64 ⁇ 64 block size, and then each CTU is deeply divided to obtain a coding unit CU.
- Each coding unit CU includes a prediction unit (Predict Unit, PU) and a transform unit (Transform Unit, TU), that is, a prediction unit is a finer-grained unit obtained by deeply dividing the coding unit.
- Predict Unit PU
- Transform Unit Transform Unit
- each CU adopts the top-down division rule.
- the size of the prediction unit PU in each CU is kept equal to the size of the transform unit CU.
- the CU block size is 64 ⁇ 64
- the PU size is also 64 ⁇ 64
- the CU block size is 32 ⁇ 32
- the PU size is also 32 ⁇ 32
- the CU block size is 16 ⁇ 16
- the PU size is also 16 ⁇ 16
- the CU block size is 8 ⁇ 8.
- the CU block is divided into 2 PU blocks.
- the division methods include 2 uniform divisions and 4 uneven divisions.
- Uniform division For example, if the CU block size is 64 ⁇ 64, if it is evenly divided, the sizes of the two PUs are both 64 ⁇ 32, or the sizes of the two PUs are both 32 ⁇ 64. If it is unevenly divided, the sizes of the two PUs are 64 ⁇ 16 and 64 ⁇ 48, or the sizes of the two PUs are 64 ⁇ 48 and 64 ⁇ 16, or the sizes of the two PUs are 16 ⁇ 64 and 48 ⁇ 64, or the sizes of the two PUs are 48 ⁇ 64 and 16 ⁇ 64.
- the CU block is divided into 4 PU blocks.
- the size of the PU is 32 ⁇ 32; if the CU block size is 32 ⁇ 32, the size of the PU is 16 ⁇ 16; if the CU block size is 16 ⁇ 16, the size of the PU is 8 ⁇ 16; if the CU block size is 8 ⁇ 8, the size of the PU is 4 ⁇ 4. Since the greater the division depth of the prediction unit, the greater the computing power required to encode the image, when the computing power needs to be reduced, the division depth of the prediction unit can be reduced.
- prediction unit depth division diagram shown in FIG6 is only an example, that is, the specific numerical values shown in FIG6 are only examples, and the division method of the coding unit and the prediction unit is not specifically limited in the embodiment of the present application.
- Condition 1 The first computing power is greater than the first computing power threshold.
- Condition 2 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute.
- the first computing power for video encoding on the i-1th video frame sequence exceeds the first computing power threshold, and the computing power for video encoding on the i-th video frame sequence needs to be reduced.
- the first computing power for video encoding on the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is less than the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are higher than the texture complexity information and scene complexity information of the second video frame image), and the computing power of video encoding of the i-th video frame sequence needs to be reduced.
- condition three When condition three is met, there is no need to adjust the computing power when encoding the i-th video frame sequence.
- Condition three The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is equal to the attribute level of the second image attribute. Since there is no need to adjust the computing power when encoding the i-th video frame sequence, the first prediction unit division depth of the i-th video frame sequence can be kept equal to the second prediction unit division depth of the i-1-th video frame sequence.
- the first computing power for video encoding of the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is equal to the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are unchanged relative to the texture complexity information and scene complexity information of the second video frame image), and there is no need to adjust the computing power of video encoding of the i-th video frame sequence.
- the computing power of the i-th video frame sequence during video encoding can be increased.
- Condition 4 The first computing power is less than the second computing power threshold.
- Condition 5 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is less than the attribute level of the second image attribute.
- the first computing power for video encoding the i-1th video frame sequence is less than the second computing power threshold.
- the computing power for video encoding the i-th video frame sequence can be increased.
- the first computing power for video encoding of the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is greater than the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are lower than the texture complexity information and scene complexity information of the second video frame image).
- the computing power for video encoding of the i-th video frame sequence can be improved.
- the method provided in the embodiment of the present application when encoding the current video frame sequence, adaptively adjusts the prediction unit division depth of the current video frame sequence according to the server computing power consumed by the previous video frame sequence, the first image attribute of the first video frame image in the current video frame sequence, and the second image attribute of the second video frame image in the previous video frame sequence, so that the adjusted prediction unit division depth can meet the encoding requirements of the current video frame sequence. requirements, improve frame stability, and reduce server deployment costs.
- the encoding parameters include motion estimation parameters and motion compensation parameters
- FIG. 7a is a schematic diagram of the determination process of the motion estimation parameters and the motion compensation parameters
- FIG. 7b is a schematic diagram of the determination architecture of the motion estimation parameters and the motion compensation parameters
- step S150 includes sub-steps S1531 to S1532. Specifically:
- the first motion estimation parameter is determined by controlling the first maximum pixel range and the first sub-pixel estimation complexity of the motion search
- the second motion estimation parameter is determined by controlling the second maximum pixel range and the second sub-pixel estimation complexity of the motion search; the first maximum pixel range is smaller than the second maximum pixel range, and the first sub-pixel estimation complexity is smaller than the second sub-pixel estimation complexity.
- the first motion compensation parameter is determined by the first search range, and the second motion compensation parameter is determined by the second search range; the first search range is smaller than the second search range.
- the first motion estimation parameter is determined by controlling the first maximum pixel range and the first sub-pixel estimation complexity of the motion search
- the second motion estimation parameter is determined by controlling the second maximum pixel range and the second sub-pixel estimation complexity of the motion search; the first maximum pixel range is equal to the second maximum pixel range, and the first sub-pixel estimation complexity is equal to the second sub-pixel estimation complexity.
- the first motion compensation parameter is determined by the first search range, and the second motion compensation parameter is determined by the second search range; the first search range is equal to the second search range.
- the first motion estimation parameter is determined by controlling the first maximum pixel range and the first sub-pixel estimation complexity of the motion search
- the second motion estimation parameter is determined by controlling the second maximum pixel range and the second sub-pixel estimation complexity of the motion search; the first maximum pixel range is greater than the second maximum pixel range, and the first sub-pixel estimation complexity is greater than the second sub-pixel estimation complexity.
- the first motion compensation parameter is determined by the first search range, and the second motion compensation parameter is determined by the second search range; the first search range is greater than the second search range.
- FIG. 8 is a schematic diagram of motion estimation provided by an embodiment of the present application.
- Motion estimation is to find a suitable matching area (B) in the reference frame for a certain area (A) in the current frame, and the motion estimation parameters are correspondingly the search parameters based on which the matching area in the reference frame is found for a certain area in the current frame.
- the reference frame can be a frame before the current frame or a frame after the current frame.
- the motion estimation parameters include parameters for controlling the motion search.
- the maximum pixel range and sub-pixel estimation complexity of the search are controlled in pixels.
- the maximum pixel range of the motion search is the maximum motion search range controlled in pixels, including: DIA (diamond), hex (hexagon), umh (uneven multi-hex), esa (exhaustive exhaustive), tesa (transformed exhaustive improved exhaustive); among them, the computing power required from DIA, hex, umh, esa to tesa increases in sequence, for example, DIA consumes the least computing power, and tesa consumes the most computing power.
- Sub-pixel estimation complexity is used to characterize the complexity of motion estimation, which is divided into 11 levels from 0 to 10. The higher the complexity, the greater the computing power consumed. For example, the computing power consumed by the sub-pixel estimation complexity of 10 is greater than the computing power consumed by the sub-pixel estimation complexity of 0.
- FIG. 9 is a schematic diagram of motion compensation provided by an embodiment of the present application.
- the purpose of motion compensation is to find the difference between area A and area B.
- the motion compensation parameters include a search range. The larger the search range, the greater the computing power consumed.
- Some motion vectors and residuals are generated by motion compensation and motion estimation predictive coding.
- the motion vector is the motion trajectory of certain areas with respect to the reference frame, and the residual is the difference between the predicted frame and the current frame generated after the movement of these areas.
- Condition 1 The first computing power is greater than the first computing power threshold.
- Condition 2 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute.
- the second motion estimation parameter used when encoding the i-1th video frame sequence includes the second maximum pixel range for controlling the motion search as tesa, and the second sub-pixel estimation complexity as level 10.
- the first computing power for encoding the i-1th video frame sequence exceeds the first computing power threshold, and the computing power for video encoding of the i-th video frame sequence needs to be reduced.
- the first motion estimation parameter used when encoding the i-th video frame sequence is adjusted to the second maximum pixel range for controlling the motion search as umh, and the second sub-pixel estimation complexity as level 8, which meets the requirement for reducing the computing power for video encoding of the i-th video frame sequence.
- the first maximum pixel range (umh) is smaller than the second maximum pixel range (tesa)
- the first sub-pixel estimation complexity (level 8) is smaller than the second sub-pixel estimation complexity (level 10).
- the first motion compensation parameter used when encoding the i-th video frame sequence is reduced, so that the first search range in the first motion compensation parameter is smaller than the second search range of the second motion compensation parameter used when encoding the i-1th video frame sequence.
- the second motion estimation parameters used when performing video encoding on the i-1th video frame sequence include the second maximum pixel range for controlling the motion search as tesa, and the second sub-pixel estimation complexity as level 10.
- the first computing power for performing video encoding on the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is less than the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are higher than the texture complexity information and scene complexity information of the second video frame image), and the computing power of video encoding on the i-th video frame sequence needs to be reduced.
- the first motion estimation parameters used when performing video encoding on the i-th video frame sequence are adjusted to the first maximum pixel range for controlling the motion search as umh, and the first sub-pixel estimation complexity as level 8, which meets the requirement of reducing the computing power of video encoding on the i-th video frame sequence.
- the first maximum pixel range (umh) is smaller than the second maximum pixel range (tesa)
- the first sub-pixel estimation complexity (level 8) is smaller than the second sub-pixel estimation complexity (level 10).
- the first motion compensation parameter used when encoding the i-th video frame sequence is reduced so that the first motion
- the first search range in the compensation parameter is smaller than the second search range of the second motion compensation parameter used when performing video encoding on the (i-1)th video frame sequence.
- condition three there is no need to adjust the computing power when the i-th video frame sequence is encoded.
- Condition three The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is equal to the attribute level of the second image attribute. Since there is no need to adjust the computing power when the i-th video frame sequence is encoded, the first motion estimation parameter of the i-th video frame sequence can be kept equal to the second motion estimation parameter of the i-1-th video frame sequence, and the first motion compensation parameter of the i-th video frame sequence can be kept equal to the second motion compensation parameter of the i-1-th video frame sequence.
- the second motion estimation parameter used when performing video encoding on the i-1th video frame sequence includes the second maximum pixel range for controlling the motion search being esa, and the second sub-pixel estimation complexity being level 9.
- the first computing power for performing video encoding on the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is equal to the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are unchanged relative to the texture complexity information and scene complexity information of the second video frame image), and there is no need to adjust the computing power for video encoding of the i-th video frame sequence.
- the first motion estimation parameter of the i-th video frame sequence can be kept equal to the second motion estimation parameter of the i-1th video frame sequence, and the first motion compensation parameter of the i-th video frame sequence can be kept equal to the second motion compensation parameter of the i-1th video frame sequence; that is, the first motion estimation parameter used when performing video encoding on the i-th video frame sequence includes the first maximum pixel range for controlling the motion search being esa, and the first sub-pixel estimation complexity being level 9.
- the first search range in the first motion compensation parameter is kept equal to the second search range in the second motion compensation parameter used when performing video encoding on the (i-1)th video frame sequence.
- the computing power for video encoding of the i-th video frame sequence can be increased.
- Condition 4 The first computing power is less than the second computing power threshold.
- Condition 5 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is less than the attribute level of the second image attribute.
- the second motion estimation parameter used when encoding the i-1th video frame sequence includes the second maximum pixel range for controlling the motion search as umh, and the second sub-pixel estimation complexity as level 8.
- the first computing power for video encoding the i-1th video frame sequence is less than the second computing power threshold.
- the computing power for video encoding the i-th video frame sequence can be increased.
- the first motion estimation parameter used when encoding the i-th video frame sequence is adjusted to the second maximum pixel range for controlling the motion search as esa, and the second sub-pixel estimation complexity as level 9, which meets the requirement of improving the computing power for video encoding of the i-th video frame sequence.
- the first maximum pixel range (esa) is greater than the second maximum pixel range (umh)
- the first sub-pixel estimation complexity (level 9) is greater than the second sub-pixel estimation complexity (level 8).
- the first motion compensation parameter used when encoding the i-th video frame sequence is increased so that the first search range in the first motion compensation parameter is larger than the second search range in the second motion compensation parameter used when encoding the i-1-th video frame sequence.
- the second motion estimation parameters used when encoding the i-1th video frame sequence include a second maximum pixel range for controlling motion search as umh and a second sub-pixel estimation complexity as level 8.
- the first computing power of the row video encoding is less than the second computing power threshold. In order to improve the stability of the frame output, the computing power for video encoding of the i-th video frame sequence can be increased.
- the first motion estimation parameter used in video encoding of the i-th video frame sequence is adjusted to control the first maximum pixel range of motion search to esa, and the first sub-pixel estimation complexity to level 9, which meets the requirement of improving the computing power of video encoding of the i-th video frame sequence.
- the first maximum pixel range (esa) is greater than the second maximum pixel range (umh)
- the first sub-pixel estimation complexity (level 9) is greater than the second sub-pixel estimation complexity (level 8).
- the first motion compensation parameter used in video encoding of the i-th video frame sequence is increased, so that the first search range in the first motion compensation parameter is greater than the second search range in the second motion compensation parameter used in video encoding of the i-1-th video frame sequence.
- the method provided in the embodiment of the present application when encoding the current video frame sequence, adaptively adjusts the motion estimation parameters and motion compensation parameters of the current video frame sequence according to the server computing power consumed by the previous video frame sequence, the first image attribute of the first video frame image in the current video frame sequence, and the second image attribute of the second video frame image in the previous video frame sequence, so that the adjusted motion estimation parameters and motion compensation parameters can meet the encoding requirements of the current video frame sequence, improve frame output stability, and reduce server deployment costs.
- the encoding parameter includes the transform unit division depth
- FIG. 10a is a schematic diagram of the determination process of the transform unit division depth
- FIG. 10b is a schematic diagram of the determination architecture of the transform unit division depth
- step S150 includes sub-steps S1541 to S1544. Specifically:
- FIG. 11 is a schematic diagram of the depth division of the transformation unit provided by the embodiment of the present application.
- the video frame image is sent to the encoder, first divided into coding tree units CTU according to the 64 ⁇ 64 block size, and then each CTU is deeply divided to obtain a coding unit CU.
- Each coding unit CU includes a prediction unit PU and a transformation unit TU. That is, the transformation unit is a more fine-grained unit obtained by deeply dividing the coding unit.
- each CU adopts the top-down division rule.
- the size of the transform unit TU in each CU is kept equal to the size of the CU. For example, if the CU block size is 64 ⁇ 64, the size of the TU is also 64 ⁇ 64; if the CU block size is 32 ⁇ 32, the size of the TU is also 32 ⁇ 32; if the CU block size is 16 ⁇ 16, then The size of TU is also 16 ⁇ 16; if the CU block size is 8 ⁇ 8, the size of TU is also 8 ⁇ 8.
- the size of a TU is 32 ⁇ 32; if the CU block size is 32 ⁇ 32, the size of TU is 16 ⁇ 16; if the CU block size is 16 ⁇ 16, the size of TU is 8 ⁇ 8; if the CU block size is 8 ⁇ 8, the size of TU is 4 ⁇ 4. Since the greater the depth of the transformation unit, the greater the computing power required to encode the image, when the computing power needs to be reduced, the depth of the transformation unit can be reduced.
- Condition 1 The first computing power is greater than the first computing power threshold.
- Condition 2 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute.
- the first computing power for video encoding on the i-1th video frame sequence exceeds the first computing power threshold, and the computing power for video encoding on the i-th video frame sequence needs to be reduced.
- the first computing power for video encoding on the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is less than the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are higher than the texture complexity information and scene complexity information of the second video frame image), and the computing power of video encoding of the i-th video frame sequence needs to be reduced.
- condition three there is no need to adjust the computing power for video encoding of the i-th video frame sequence.
- the first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is equal to the attribute level of the second image attribute. Since there is no need to adjust the computing power for video encoding of the i-th video frame sequence, the first transformation unit division depth of the i-th video frame sequence can be kept equal to the second transformation unit division depth of the i-1-th video frame sequence.
- the first computing power for video encoding of the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is equal to the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are unchanged relative to the texture complexity information and scene complexity information of the second video frame image), and there is no need to adjust the computing power of video encoding of the i-th video frame sequence.
- the computing power for video encoding of the i-th video frame sequence can be increased.
- Condition 4 The first computing power is less than the second computing power threshold.
- Condition 5 The first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is less than the attribute level of the second image attribute.
- the first computing power for video encoding on the i-1th video frame sequence is less than the second computing power threshold.
- the computing power for video encoding on the i-th video frame sequence can be increased.
- the first computing power for video encoding of the i-1th video frame sequence is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level of the second video frame image in the i-1th video frame sequence is greater than the attribute level of the first video frame image in the i-th video frame sequence (the texture complexity information and scene complexity information of the first video frame image are lower than the texture complexity information and scene complexity information of the second video frame image).
- the computing power for video encoding of the i-th video frame sequence can be improved.
- the method provided in the embodiment of the present application when encoding the current video frame sequence, adaptively adjusts the transformation unit division depth of the current video frame sequence according to the server computing power consumed by the previous video frame sequence, the first image attribute of the first video frame image in the current video frame sequence, and the second image attribute of the second video frame image in the previous video frame sequence, so that the adjusted transformation unit division depth can meet the encoding requirements of the current video frame sequence, improve frame output stability, and reduce server deployment costs.
- the video encoding parameters of the i-th video frame sequence include the first coding unit division depth, the first prediction unit division depth
- the first transformation unit is divided into a depth, a first maximum pixel range, a first sub-pixel estimation complexity and a first search range.
- Step S160 includes sub-steps S1611 to S1619. Specifically:
- the target reference image is obtained by encoding the previous video frame image of the target video frame image.
- K and L are both integers greater than or equal to 1.
- the target inter-frame prediction image includes K ⁇ L target inter-frame prediction units.
- S1616 Generate a residual image according to the target video frame image and the target inter-frame prediction image.
- video encoding of the i-th video frame sequence requires encoding all video frame images in the i-th video frame sequence.
- the embodiment of the present application is illustrated by taking encoding of any video frame image in the i-th video frame sequence as an example.
- FIG 13 is a schematic diagram of encoding a target video frame image provided by an embodiment of the present application.
- Any video frame image is obtained from the i-th video frame sequence as the target video frame image, and the image obtained after encoding the previous video frame image of the target video frame image is used as the target reference image of the target video frame image.
- the target video frame image is sent to the encoder, and the target video frame image is divided into coding unit depth according to the first coding unit division depth to obtain K first coding units; the K first coding units are respectively divided into prediction unit depth according to the first prediction unit division depth to obtain K ⁇ L first prediction units.
- the target reference image is sent to the encoder, and the target reference image is divided into coding unit depth according to the first coding unit division depth to obtain K reference coding units; the K reference coding units are respectively divided into prediction unit depth according to the first prediction unit division depth to obtain K ⁇ L reference prediction units; the reference coding unit has a corresponding relationship with the first coding unit, and the reference prediction unit has a corresponding relationship with the first prediction unit.
- motion estimation processing is performed on the K ⁇ L first prediction units and the K ⁇ L reference prediction units to generate a first prediction unit composed of the K ⁇ L first motion estimation units.
- a motion estimation image Subtract the first motion estimation image (including K ⁇ L first motion estimation units) from the target video frame image (including K ⁇ L first prediction units) to obtain a residual image. Divide the residual image into transform units according to the first transform unit division depth to generate a transform image. Quantize the transform image to generate residual coefficients. Input the residual coefficients into an entropy coding module for entropy coding to generate a coding value of the target video frame image. The coding value is used to represent the encoded target video frame image.
- the method provided in the embodiment of the present application encodes each video frame image in the ith video frame sequence according to the adjusted video encoding parameters (including the first encoding unit division depth, the first prediction unit division depth, the first transform unit division depth, the first maximum pixel range, the first sub-pixel estimation complexity and the first search range) to obtain the encoding value corresponding to each video frame image, thereby realizing the video encoding process of the ith video frame sequence.
- video encoding is performed on each video frame sequence in the target video to obtain the encoded video corresponding to the target video.
- sub-step S1618 further includes sub-steps S1621 to S1624. Specifically:
- the deblocking filter is used to perform horizontal filtering on the vertical edges in the reconstructed image, and to perform vertical filtering on the horizontal edges in the reconstructed image.
- S1624 Process the first filtered image through a sampling adaptive offset filter to generate a reference image corresponding to the target video frame image.
- the reference image is used to encode the next frame image of the target video frame image, and the sampling adaptive offset filter is used to perform band offset and edge offset on the first filtered image.
- the reference image when encoding the target video frame image is the image obtained after encoding the previous video frame image of the target video frame image
- the reference image when encoding the next video frame image of the target video frame image is the image obtained after encoding the target video frame image.
- Figure 15 is a schematic diagram of the coding framework provided by an embodiment of the present application. After the residual coefficients are inversely quantized and inversely transformed, the reconstructed image residual coefficients are generated. The reconstructed image residual coefficients are added to the target inter-frame prediction image to obtain a reconstructed image.
- a reference image corresponding to the target video frame image is generated.
- the reference image corresponding to the target video frame image enters the reference frame queue as the reference image of the next frame, and is then encoded backward in sequence.
- Intra-frame prediction selection is performed based on the target video frame image and the reconstructed image to generate an intra-frame prediction selection image; intra-frame prediction is performed based on the intra-frame prediction selection image and the reconstructed image to obtain an intra-frame prediction image.
- the method provided in the embodiment of the present application uses the reference image generated according to the target video frame image as the reference image in the encoding process of the next frame, thereby improving the video encoding process of the i-th video frame sequence and improving the frame output stability.
- the video encoding parameter includes a processing cancellation message; step S150 further includes sub-step S1551. Specifically:
- video encoding also includes other processing processes, including pre-processing and post-processing.
- Pre-processing includes denoising, sharpening, and time domain filtering
- post-processing includes loop filtering and film grain (AV1 Film Grain), etc.
- Loop filtering includes adaptive compensation filtering (Deblocking, DB), adaptive loop filtering (Adaptive loop filter, ALF), sample adaptive offset (Sample Adaptive Offset, SAO), etc. These processes will cause a certain amount of server computing power consumption.
- condition one the first computing power is greater than the first computing power threshold.
- Condition two the first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute.
- the computing power when performing video encoding on the i-th video frame sequence is reduced by reducing the pre-processing process and/or the post-processing process. Specifically, this can be achieved by canceling one or more of the denoising process, the sharpening process, and the time domain filtering process.
- the method provided in the embodiment of the present application when encoding the current video frame sequence, adaptively adjusts the processing process of the current video frame sequence according to the server computing power consumed by the previous video frame sequence, the first image attribute of the first video frame image in the current video frame sequence, and the second image attribute of the second video frame image in the previous video frame sequence, so that the division depth of the adjusted coding unit can meet the encoding requirements of the current video frame sequence, improve the frame output stability, and reduce the server deployment cost.
- step S120 further includes steps S121 to S123. Specifically:
- the first video frame image and the second video frame image can be respectively input into the picture scene classification model.
- the picture scene classification model analyzes and processes the input first video frame image and the second video frame image, and outputs the first scene complexity information of the first video frame image and the second scene complexity information of the second video frame image.
- the picture scene classification model is a feature extraction classification model based on deep learning and neural networks, which recognizes and classifies picture scenes and divides them into simple scenes, general scenes, medium scenes and complex scenes.
- the algorithms in the model include CNN, decision tree, random forest, etc.
- deep learning CNN, CNN+RNN and other algorithm frameworks are used to recognize video frame images, and video frame images are used as training samples for deep learning related models.
- the accuracy of scene recognition through a pure CNN network (convolutional layer, filtering, pooling layer, etc.) model with obvious picture features can reach more than 99%; for scenes with relatively scattered picture features (such as TV drama scenes, outdoor sports scenes, food scenes, travel scenes, etc.), the accuracy of scene recognition through CNN combined with RNN+LSTM for time domain + frequency domain analysis can reach about 90%.
- S122 Determine first texture complexity information of the first video frame image and second texture complexity information of the second video frame image according to the first video frame image and the second video frame image through a picture texture classification model.
- the first video frame image and the second video frame image can be input into the picture texture classification model respectively.
- the picture texture classification model analyzes and processes the input first video frame image and the second video frame image, and outputs the first texture complexity information of the first video frame image and the second texture complexity information of the second video frame image.
- Texture complexity information includes simple texture, general texture, medium texture and complex texture.
- Texture complexity information analysis methods include: Euclidean distance, statistical histogram, LBP detection algorithm and CNN feature extraction classification algorithms, image complexity estimation methods based on edge features, commonly used algorithms include Canny, Sobel, Robert and other edge detection operator algorithms, etc., and this application does not limit them here.
- LBP refers to the local binary pattern, which is an operator used to describe the local features of an image.
- the LBP feature has significant advantages such as grayscale invariance and rotation invariance.
- the original LBP operator is defined as a 3 ⁇ 3 window, with the center pixel of the window as the threshold, comparing the grayscale values of the adjacent 8 pixels with it. If the surrounding pixel value is greater than the center pixel value, the position of the pixel is marked as 1, otherwise it is 0.
- the 8 pixels in the 3 ⁇ 3 neighborhood can generate an 8-bit binary number (usually converted to a decimal number, namely the LBP code, a total of 256 types) after comparison, that is, the LBP value of the center pixel of the window is obtained, and this value is used to reflect the texture information of the area.
- the detection area can be automatically adjusted according to the video screen resolution and computing power, or the screen resolution can be adjusted by downsampling.
- the LBP value of the area detected by the comprehensive screen is summarized and calculated. For example, if 80% of the LBP values are concentrated, those below 50 are simple textures, those between 50 and 100 are general textures, those between 100 and 150 are medium textures, and those greater than 150 are complex textures.
- scene complexity information and the texture complexity information are used as image attributes, and the complexity of the video frame image is represented by the image attributes to match the encoding parameters.
- the method provided in the embodiment of the present application determines the scene complexity information of the video frame image by using a picture scene classification model, and determines the texture complexity information of the video frame image by using a picture texture classification model, and then determines the image attributes based on the determined scene complexity information and texture complexity information, thereby ensuring the accuracy and reliability of the determined image attributes, which is conducive to accurately adjusting the video encoding parameters according to the image attributes in the subsequent process.
- step S160 further includes steps S161 to S165. Specifically:
- the i-th video frame sequence and the i+1-th video frame sequence are adjacent to each other in the target video.
- the third video frame image corresponds to a third image attribute.
- steps S161 to S165 are the process of encoding the i+1th video frame sequence.
- the computing power consumed by encoding the i-th video frame sequence is calculated.
- the third video frame image in the i+1th video frame sequence the third video frame image may be an IDR frame in the i+1th video frame sequence, the third video frame image corresponds to a third image attribute, and the third image attribute is used to characterize the texture complexity information and scene complexity information of the third video frame image.
- Encoding the i+1th video frame sequence according to the video encoding parameters of the i+1th video frame sequence specifically means that all video frame images in the i+1th video frame sequence are encoded according to the video encoding parameters of the i+1th video frame sequence to obtain an encoded image of each video frame image, and all the encoded images constitute the i+1th encoded video segment.
- the video processing method provided by the embodiment of the present application performs video encoding on the input video. If the scene in the input video is unique and fixed, the video encoding parameters are matched according to the scene, and the input video is encoded by the matched video encoding parameters, so that the encoding parameters can meet the encoding requirements of the current video frame sequence, improve the frame output stability, and reduce the server deployment cost. If the input video includes multiple scenes, the input video is segmented according to the scene changes to obtain N video frame sequences, and the encoding task of the input video is decomposed into encoding the N video frame sequences contained in the input video respectively.
- the video encoding parameters of the current video frame sequence are adaptively adjusted, so that the adjusted encoding parameters can meet the encoding requirements of the current video frame sequence, improve the frame output stability, and reduce the server deployment cost.
- step S110 includes sub-steps S1101 to S1103. Specifically:
- S1102 Perform scene recognition on the input video using a scene recognition model to obtain N scenes.
- the scene recognition model is used to identify the scenes appearing in the input video.
- the scene recognition model can be a feature extraction and classification model based on deep learning and neural networks. Commonly used algorithms include CNN, decision tree, random forest, etc. Taking deep learning as an example, deep learning CNN, CNN+RNN and other algorithm frameworks are used to generate images for video frames, and the images generated by video frames are used as training samples for deep learning related models.
- the input video contains three scenes: a meeting scene, a theater scene, and a swimming scene.
- the scene complexity of the meeting scene is simple
- the scene complexity of the theater scene is medium
- the scene complexity of the swimming scene is complex.
- the input video is segmented by the scene recognition model as follows: the input video is used as the input of the trained scene recognition model, and the scene recognition model recognizes that the input video contains three scenes, outputs three sub-video segments corresponding to the three scenes, and represents each sub-video segment as a video frame sequence.
- the video processing method provided in the embodiment of the present application uses a scene recognition model to identify the scenes included in the input video, and divides the input video according to the scenes included in the input video to obtain N video frame sequences, thereby ensuring that the video
- the rationality of the video frame sequence segmentation is that the video frame images included in each video frame sequence correspond to the same scene, and accordingly the computing power consumed when encoding each video frame sequence will not fluctuate significantly, that is, it is ensured that the video encoding parameters corresponding to each video frame sequence can be better applied to encoding each video frame image in the video frame sequence.
- Figure 19 is a schematic flow chart of a video processing method provided in an embodiment of the present application.
- the input video is segmented according to scene changes in the input video to obtain N video frame sequences (GOPs), and the encoding parameters of each GOP are adjusted with GOP as the minimum granularity.
- GOPs video frame sequences
- the i-th video frame sequence and the i-1-th video frame sequence are obtained from the N video frame sequences, wherein the i-th video frame sequence and the i-1-th video frame sequence are adjacent in the input video.
- the first video frame image corresponds to the first image attribute
- the second video frame image corresponds to the second image attribute
- the image attribute is used to characterize the texture complexity information and scene complexity information of the video frame image.
- the first video frame image can be an IDR frame in the i-th video frame sequence
- the second video frame image can be an IDR frame in the i-1-th video frame sequence.
- Each video frame image corresponds to an image attribute, and the image attribute is used to characterize the texture complexity information and scene complexity information of the video frame image.
- a first computing power corresponding to the i-1th video frame sequence is obtained, wherein the first computing power is used to represent the server computing power consumed when performing video encoding on the i-1th video frame sequence.
- the video encoding parameters of the i-1th video frame sequence are adjusted on the basis of the video encoding parameters of the i-1th video frame sequence:
- the first computing power threshold refers to a computing power threshold provided by the server for video encoding. For example, if the first computing power threshold provided by the server for video encoding is 1000, and the first computing power corresponding to the i-1-th video frame sequence exceeds 1000, the computing power needs to be reduced when encoding the i-th video frame sequence. In order to reduce the computing power when encoding the i-th video frame sequence, the video encoding parameters of the i-th video frame sequence need to be lowered.
- the video encoding parameters of the i-th video frame sequence are lowered, wherein the first computing power threshold is greater than the second computing power threshold.
- the first computing power threshold provided by the server for video encoding is 1000
- the second computing power threshold is 800
- the first computing power for video encoding of the i-1-th video frame sequence is 900
- the first computing power is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level is larger (the texture complexity is higher and/or the scene complexity is higher), the required computing power is higher.
- the attribute level of the first image attribute is greater than the attribute level of the second image attribute, indicating that the computing power required for video encoding of the i-th video frame sequence will be greater than the computing power for video encoding of the i-1-th video frame sequence, and the computing power for video encoding of the i-th video frame sequence needs to be reduced.
- the video encoding parameters of the i-th video frame sequence need to be lowered.
- the video encoding parameters of the i-th video frame sequence are kept the same as the video encoding parameters of the i-1-th video frame sequence.
- the first computing power threshold provided by the server for video encoding is 1000
- the second computing power threshold is 800
- the first computing power of the video encoding of the i-1-th video frame sequence is 900.
- the first computing power is greater than the second computing power.
- the attribute level of the first image attribute is equal to the attribute level of the second image attribute, which means that the computing power required for video encoding of the i-th video frame sequence will be equal to or close to the computing power for video encoding of the i-1-th video frame sequence, without adjusting the video encoding parameters of the i-th video frame sequence.
- the video encoding parameters of the i-th video frame sequence are adjusted upward.
- the first computing power threshold provided by the server for video encoding is 1000
- the second computing power threshold is 800
- the first computing power for video encoding of the i-1th video frame sequence is 900
- the first computing power is greater than the second computing power threshold and less than the first computing power threshold.
- the attribute level is larger (the higher the texture complexity and/or the higher the scene complexity), the higher the computing power required.
- the attribute level of the first image attribute is less than the attribute level of the second image attribute, which means that the computing power required for video encoding of the i-th video frame sequence will be less than the computing power for video encoding of the i-1th video frame sequence.
- the video encoding parameters of the i-th video frame sequence can be adjusted upward.
- the video encoding parameters of the i-th video frame sequence are adjusted upward.
- the second computing power threshold provided by the server for video encoding is 800
- the first computing power for video encoding of the i-1-th video frame sequence is 700
- the first computing power is less than the second computing power threshold.
- the video encoding parameters of the i-th video frame sequence can be adjusted upward.
- all video frame images in the ith video frame sequence are encoded to obtain an encoded image of each video frame image, and all the encoded images constitute the ith encoded video segment.
- the order of adjusting the video encoding parameters of the i-th video frame sequence is:
- the first coding unit division depth of the i-th video frame sequence is adjusted according to the second coding unit division depth, wherein the first coding unit division depth is lower than the second coding unit division depth.
- the first coding unit division depth of the i-th video frame sequence is kept equal to the second coding unit division depth of the i-1-th video frame sequence.
- the first coding unit division depth of the i-th video frame sequence is adjusted according to the second coding unit division depth, wherein the first coding unit division depth is higher than the second coding unit division depth.
- a second motion estimation parameter and a second motion compensation parameter of the (i-1)th video frame sequence are obtained.
- the first motion estimation parameter of the i-th video frame sequence is adjusted according to the second motion estimation parameter
- the second motion compensation parameter is adjusted according to the second motion estimation parameter.
- the first motion estimation parameter is determined by controlling the first maximum pixel range and the first sub-pixel estimation complexity of the motion search
- the second motion estimation parameter is determined by controlling the second maximum pixel range and the second sub-pixel estimation complexity of the motion search; the first maximum pixel range is smaller than the second maximum pixel range, and the first sub-pixel estimation complexity is smaller than the second sub-pixel estimation complexity.
- the first motion compensation parameter is determined by the first search range, and the second motion compensation parameter is determined by the second search range; the first search range is smaller than the second search range.
- the first motion estimation parameter of the i-th video frame sequence is kept equal to the second motion estimation parameter of the i-1-th video frame sequence
- the first motion compensation parameter of the i-th video frame sequence is kept equal to the second motion compensation parameter of the i-1-th video frame sequence.
- the first maximum pixel range is equal to the second maximum pixel range
- the first sub-pixel estimation complexity is equal to the second sub-pixel estimation complexity.
- the first search range is equal to the second search range.
- the first motion estimation parameter of the i-th video frame sequence is adjusted according to the second motion estimation parameter
- the first motion compensation parameter of the i-th video frame sequence is adjusted according to the second motion compensation parameter.
- the first maximum pixel range is greater than the second maximum pixel range
- the first sub-pixel estimation complexity is greater than the second sub-pixel estimation complexity.
- the first search range is greater than the second search range.
- the first transformation unit division depth of the i-th video frame sequence is adjusted according to the second transformation unit division depth, wherein the first transformation unit division depth is lower than the second transformation unit division depth.
- the first transformation unit division depth of the i-th video frame sequence is kept equal to the second transformation unit division depth of the i-1-th video frame sequence.
- the first transformation unit division depth of the i-th video frame sequence is adjusted according to the second transformation unit division depth, wherein the first transformation unit division depth is higher than the second transformation unit division depth.
- the first prediction unit division depth of the i-th video frame sequence is adjusted according to the second prediction unit division depth, wherein the first prediction unit division depth is lower than the second prediction unit division depth.
- the first prediction unit division depth of the i-th video frame sequence is kept equal to the second prediction unit division depth of the i-1-th video frame sequence.
- the first prediction unit division depth of the i-th video frame sequence is adjusted according to the second prediction unit division depth, wherein the first prediction unit division depth is higher than the second prediction unit division depth.
- the depth is divided according to the second prediction unit, and one or more of the denoising processing, the sharpening processing and the time domain filtering processing are canceled.
- the solution provided by the embodiment of the present application automatically analyzes the complexity of video scene texture and scene classification during the video encoding pre-analysis process, and adaptively cuts the processes with relatively large computing power consumption such as coding unit division, MC, ME, transformation, preprocessing, and lookahead according to the video encoding scene switching detection, as well as the video picture texture and scene classification evaluation.
- This is a coding kernel computing power consumption smoothing solution that maintains balanced video encoding computing power when a certain video picture bd-rate is lost.
- the server computing power can be smoothly controlled under the condition of controlling the loss of a certain video bd-rate, which improves the server machine computing power load by 5-10 points, greatly saves the video cloud media processing transcoding cost, and helps video cloud users reduce costs and increase efficiency in media processing transcoding.
- Figure 20 is a schematic diagram of an embodiment of a video processing device 10 in an embodiment of the present application, and the video processing device 10 includes:
- the video frame sequence acquisition module 110 is used to acquire N video frame sequences of the input video, wherein each video frame sequence includes at least one video frame image, and N is an integer greater than 1.
- the video frame sequence extraction module 120 is used to obtain the ith video frame sequence and the i-1th video frame sequence from the N video frame sequences, wherein the ith video frame sequence and the i-1th video frame sequence are adjacent in the target video, and i is an integer greater than 1.
- the video frame image acquisition module 130 is used to acquire a first video frame image from the i-th video frame sequence and acquire a second video frame image from the i-1-th video frame sequence, wherein the first video frame image corresponds to a first image attribute and the second video frame image corresponds to a second image attribute.
- the computing power acquisition module 140 is used to acquire a first computing power corresponding to the i-1th video frame sequence, wherein the first computing power is used to represent the computing power consumed when encoding and/or decoding the i-1th video frame sequence.
- the encoding parameter determination module 150 is used to determine the encoding parameter of the i-th video frame sequence according to at least one of the first computing power, the first image attribute and the second image attribute.
- the video processing device 10 further includes:
- the video frame sequence encoding module 160 is used to encode the ith video frame sequence according to the encoding parameters of the ith video frame sequence to obtain the ith encoded video segment.
- the video processing device performs video encoding on an input video. If the scene in the input video is unique and fixed, the video encoding parameters are matched according to the scene, and the input video is encoded by the matched video encoding parameters, so that the encoding parameters can meet the encoding requirements of the current video frame sequence, thereby improving the frame output stability. And reduce the server deployment cost. If the input video includes multiple scenes, the input video is segmented according to the scene changes to obtain N video frame sequences, and the encoding task of the input video is decomposed into encoding the N video frame sequences contained in the input video respectively.
- the encoding parameters of the current video frame sequence are adaptively adjusted, so that the adjusted encoding parameters can meet the encoding requirements of the current video frame sequence, improve the frame output stability, and reduce the server deployment cost.
- the encoding parameter includes the encoding unit division depth.
- the encoding parameter determination module 150 includes a coding unit division depth adjustment submodule 151, and the coding unit division depth adjustment submodule 151 is used to:
- the first computing power is greater than the first computing power threshold, or when the first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute, according to the second coding unit division depth, adjust the first coding unit division depth of the i-th video frame sequence to be lower than the second coding unit division depth.
- the first coding unit division depth of the i-th video frame sequence is kept equal to the second coding unit division depth of the i-1-th video frame sequence.
- the first coding unit division depth of the i-th video frame sequence is adjusted to be higher than the second coding unit division depth according to the second coding unit division depth.
- the encoding parameter includes the prediction unit division depth.
- the encoding parameter determination module 150 includes a prediction unit division depth adjustment submodule 152, and the prediction unit division depth adjustment submodule 152 is used to:
- the first prediction unit division depth adjusts the first prediction unit division depth of the i-th video frame sequence to be lower than the second prediction unit division depth.
- the first prediction unit division depth of the i-th video frame sequence is kept equal to the second prediction unit division depth of the i-1-th video frame sequence.
- the first prediction unit division depth of the i-th video frame sequence is adjusted to be higher than the second prediction unit division depth according to the second prediction unit division depth.
- the encoding parameters include motion estimation parameters and motion compensation parameters.
- the encoding parameter determination module 150 includes a motion estimation parameter and motion compensation parameter adjustment submodule 153, and the motion estimation parameter and motion compensation parameter adjustment submodule 153 is used to:
- a second motion estimation parameter and a second motion compensation parameter of the (i-1)th video frame sequence are obtained.
- the first motion estimation parameter of the i-th video frame sequence is adjusted according to the second motion estimation parameter
- the first motion compensation parameter of the i-th video frame sequence is adjusted according to the second motion compensation parameter
- the first motion estimation parameter is determined by controlling the first maximum pixel range and the first sub-pixel estimation complexity of the motion search
- the second motion estimation parameter is determined by controlling the second maximum pixel range and the second sub-pixel estimation complexity of the motion search; the first maximum pixel range is smaller than the second maximum pixel range, and the first sub-pixel estimation complexity is smaller than the second sub-pixel estimation complexity.
- the first motion compensation parameter is determined by the first search range, and the second motion compensation parameter is determined by the second search range; the first search range is smaller than the second search range.
- the first motion estimation parameter of the i-th video frame sequence is kept equal to the second motion estimation parameter of the i-1-th video frame sequence
- the first motion compensation parameter of the i-th video frame sequence is kept equal to the second motion compensation parameter of the i-1-th video frame sequence.
- the first maximum pixel range is equal to the second maximum pixel range
- the first sub-pixel estimation complexity is equal to the second sub-pixel estimation complexity
- the first search range is equal to the second search range.
- the first motion estimation parameter of the i-th video frame sequence is adjusted according to the second motion estimation parameter
- the first motion compensation parameter of the i-th video frame sequence is adjusted according to the second motion compensation parameter
- the first maximum pixel range is larger than the second maximum pixel range
- the first sub-pixel estimation complexity is larger than the second sub-pixel estimation complexity
- the first search range is larger than the second search range.
- the encoding parameter includes a transform unit division depth;
- the encoding parameter determination module 150 includes a transform unit division depth adjustment submodule 154, and the transform unit division depth adjustment submodule 154 is used to:
- the first transformation unit division depth When the first computing power is greater than the first computing power threshold, or when the first computing power is greater than the second computing power threshold and less than the first computing power threshold, and the attribute level of the first image attribute is greater than the attribute level of the second image attribute, according to the second transformation unit division depth, adjust the first transformation unit division depth of the i-th video frame sequence to be lower than the second transformation unit division depth.
- the first transformation unit division depth of the i-th video frame sequence is kept equal to the second transformation unit division depth of the i-1-th video frame sequence.
- the first computing power is less than the second computing power threshold, or when the first computing power is greater than the second computing power threshold and less than the a computing power threshold and when the attribute level of the first image attribute is less than the attribute level of the second image attribute, adjusting the first transformation unit division depth of the i-th video frame sequence to be higher than the second transformation unit division depth according to the second transformation unit division depth.
- the video encoding parameters of the i-th video frame sequence include a first coding unit division depth, a first prediction unit division depth, a first transform unit division depth, a first maximum pixel range, a first sub-pixel estimation complexity and a first search range; the video frame sequence encoding module 160 is specifically used to:
- a target video frame image and a target reference image of the target video frame image are obtained, wherein the target reference image is obtained by encoding a previous video frame image of the target video frame image;
- K is an integer greater than or equal to 1;
- the residual coefficients are entropy encoded to generate the encoding value of the target video frame image.
- the video frame sequence encoding module 160 is further configured to:
- the first filtered image is processed by a sampling adaptive offset filter to generate a reference image corresponding to the target video frame image, wherein the reference image is used to encode the next frame image of the target video frame image, and the sampling adaptive offset filter is used to perform band offset and edge offset on the first filtered image.
- the device provided in the embodiment of the present application uses the reference image generated according to the target video frame image as the reference image in the encoding process of the next frame, thereby improving the video encoding process of the i-th video frame sequence and improving the frame output stability.
- the coding parameter determination module 150 includes a cancellation message processing submodule 155, which is used to:
- the denoising processing, sharpening processing and time domain filtering processing of the i-th video frame sequence is canceled according to the processing cancellation message.
- the video frame image acquisition module 120 is further used for:
- a first image attribute is generated according to the first scene complexity information and the first texture complexity information
- a second image attribute is generated according to the second scene complexity information and the second texture complexity information.
- the video frame sequence encoding module 160 is further configured to:
- the video frame sequence acquisition module 110 is specifically used for:
- the input video is subjected to scene recognition to obtain N scenes, wherein the scene recognition model is used to recognize scenes appearing in the input video;
- the input video is segmented according to N scenes to obtain N video clips.
- the server 300 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 322 (for example, one or more processors) and memory 332, and one or more storage media 330 (for example, one or more mass storage devices) storing application programs 342 or data 344.
- the memory 332 and the storage medium 330 may be temporary storage or permanent storage.
- the program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
- the central processing unit 322 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the server 300.
- the server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces. port 350, one or more input and output interfaces 358, and/or, one or more operating systems 341, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM , etc.
- the disclosed systems, devices and methods can be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present application, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
本申请提供了一种视频处理方法以及相关装置。本申请实施例可应用于云计算领域。其方法包括:将对输入视频的编码任务分解为对组成输入视频的N个视频帧序列分别进行编码处理,在对当前的视频帧序列进行编码时,根据编码前一个视频帧序列消耗的算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,适应性地调整当前的视频帧序列的编码参数,使得调整后的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
Description
本申请要求于2022年10月19日提交中国专利局、申请号为2022112824172、申请名称为“一种视频处理方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及数据处理技术领域,尤其涉及视频处理技术。
随着视频行业飞速发展,视频的应用正朝着高清晰度、高帧率方向快速升级,对视频处理的需求也越来越多,而视频编码作为视频处理的基础,优异的编码能力能够为产品提供高清、流畅的播放体验。
目前,视频编码内核在应用时设置固定的编码参数,对输入的各种视频源采用相同的编码参数进行视频编码。编码参数影响视频编码内核的出帧稳定性,编码参数越多且越复杂,视频出帧稳定性越高,但相应所需的视频编码内核算力也越多。直播、实时视频通信、云渲染、云桌面等场景对视频编码内核的出帧稳定性具有较高要求。静态、运动纹理变化小的画面消耗的编码算力相对较少,而运动纹理比较复杂、场景切换比较频繁的画面消耗的编码算力相对较多。在同一视频中同时包含静态、运动纹理变化小的画面和运动纹理比较复杂、场景切换比较频繁的画面的情况下,对该视频均采用相同的编码参数,若编码参数设置的较多较复杂,则对于静态、运动纹理变化小的画面会导致较高的服务器部署成本,若编码参数设置的较少较简单,则对于运动纹理比较复杂、场景切换比较频繁的画面,会导致视频编码压缩的算力不足,使得视频编码内核的出帧稳定性较差。
可见,如何兼顾服务器部署成本以及出帧稳定性,是目前亟待解决的问题。
发明内容
本申请实施例提供了一种视频处理方法以及相关装置,可以适应性地调整视频的编码参数,使得调整后的编码参数满足对应的编码需求,提高出帧稳定性,且降低服务器部署成本。
本申请的一方面提供一种视频处理方法,由计算机设备执行,包括:
获取输入视频的N个视频帧序列,其中,每个视频帧序列包括至少一个视频帧图像,N为大于1的整数;
从N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列,i为大于1的整数;
从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像,其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性;
获取第i-1个视频帧序列对应的第一算力,其中,第一算力用于表征对第i-1个视频帧
序列进行编码和/或解码时消耗的算力;
根据第一算力、第一图像属性及第二图像属性,确定第i个视频帧序列的编码参数。
本申请的另一方面提供了一种视频处理装置,包括:
视频帧序列获取模块,用于获取输入视频的N个视频帧序列,其中,每个视频帧序列包括至少一个视频帧图像,N为大于1的整数;
视频帧序列提取模块,用于从N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列,i为大于1的整数;
视频帧图像获取模块,用于从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像,其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性;
算力获取模块,用于获取第i-1个视频帧序列对应的第一算力,其中,第一算力用于表征对第i-1个视频帧序列进行编码和/或解码时消耗的算力;
视频编码参数确定模块,用于根据第一算力、第一图像属性及第二图像属性,确定第i个视频帧序列的编码参数。
本申请的另一方面提供了一种计算机设备,包括:
存储器、收发器、处理器以及总线系统;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,包括执行上述各方面的方法;
总线系统用于连接存储器以及处理器,以使存储器以及处理器进行通信。
本申请的另一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
本申请的另一方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方面所提供的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请提供了一种视频处理方法以及相关装置,其方法包括:首先,获取输入视频的N个视频帧序列,其中,每个视频帧序列包括至少一个视频帧图像;其次,从N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列;再次,从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像,其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性;接着,获取第i-1个视频帧序列对应的第一算力,其中,第一算力用于表征对第i-1个视频帧序列进行编码和/或解码时消耗的算力;然后,根据第一算力、第一图像属性及第二图像属性中的至少一项,确定第i个视频帧序列的编码参数,该编码参数用于对第i个视频帧序列进行编码。本申请实施例提供的视频处理方法,将对输入视频的编码任务分解为对输入视频包含的N个视频帧序列分别进行编码,对当前的视频帧序列(即第i个视频帧序列)进行编码时,根据编码和/或解码前一个视频帧序列(即第i-1个视频帧序列)时消耗的算力、当前的视频帧
序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,适应性地确定当前的视频帧序列的编码参数,具体的,可以从保障出帧稳定性的角度出发,根据编码和/或解码前一个视频帧序列时消耗的算力大小、以及第一图像属性与第二图像属性之间的关系,适应性地确定当前的视频帧序列的编码参数相对于前一个视频帧序列的编码参数是保持不变,还是增大或减小,如此,借鉴这两个相邻的视频帧序列中视频帧图像的图像属性之间的关系,以编码和/或解码前一个视频帧序列时消耗的算力为基础,对当前的视频帧序列的编码参数进行设置,以使得所确定的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
图1为本申请某一实施例提供的视频处理系统的一个架构示意图;
图2a为本申请某一实施例提供的视频处理方法的流程图;
图2b为本申请某一实施例提供的视频处理方法的架构图;
图3a为本申请另一实施例提供的视频处理方法的流程图;
图3b为本申请另一实施例提供的视频处理方法的架构图;
图4为本申请某一实施例提供的编码单元深度划分的示意图;
图5a为本申请另一实施例提供的视频处理方法的流程图;
图5b为本申请另一实施例提供的视频处理方法的架构图;
图6为本申请某一实施例提供的预测单元深度划分的示意图;
图7a为本申请另一实施例提供的视频处理方法的流程图;
图7b为本申请另一实施例提供的视频处理方法的架构图;
图8为本申请某一实施例提供的运动估计的示意图;
图9为本申请某一实施例提供的运动补偿的示意图;
图10a为本申请另一实施例提供的视频处理方法的流程图;
图10b为本申请另一实施例提供的视频处理方法的架构图;
图11为本申请某一实施例提供的变换单元深度划分的示意图;
图12为本申请另一实施例提供的视频处理方法的流程图;
图13为本申请某一实施例提供的对目标视频帧图像进行编码的示意图;
图14为本申请另一实施例提供的视频处理方法的流程图;
图15为本申请某一实施例提供的编码框架的示意图;
图16为本申请另一实施例提供的视频处理方法的流程图;
图17为本申请另一实施例提供的视频处理方法的流程图;
图18为本申请另一实施例提供的视频处理方法的流程图;
图19为本申请又一实施例提供的视频处理方法的流程图;
图20为本申请某一实施例提供的视频处理装置的结构示意图;
图21为本申请另一实施例提供的视频处理装置的结构示意图;
图22为本申请另一实施例提供的视频处理装置的结构示意图;
图23为本申请另一实施例提供的视频处理装置的结构示意图;
图24为本申请另一实施例提供的视频处理装置的结构示意图;
图25为本申请又一实施例提供的视频处理装置的结构示意图;
图26为本申请某一实施例提供的服务器结构示意图。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“对应于”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解本申请实施例提供的技术方案,这里先对本申请实施例使用的一些关键名词进行解释:
视频编码(Video Encoding):用于通过压缩技术,将原始视频格式的文件转换成另一种视频格式的文件。视频流传输中最为重要的编解码标准有国际电联的H.261、H.263、H.264。
H.264是新一代的编码标准,以高压缩、高质量和支持多种网络的流媒体传输著称。在H.264协议里定义了三种帧,完整编码的帧为I帧,参考之前的I帧生成的只包含差异部分编码的帧为P帧,参考前后的帧编码的帧为B帧。H.264采用的核心算法是帧内压缩和帧间压缩,帧内压缩是生成I帧的算法,帧间压缩是生成B帧和P帧的算法。
在H.264中,图像以序列为单位进行组织,一个序列是一段图像编码后的数据流,以I帧开始,到下一个I帧结束。一个序列的第一个图像叫做IDR图像(立即刷新图像),IDR图像都是I帧图像。H.264引入IDR图像是为了解码的重同步,当解码器解码到IDR图像时,立即将参考帧队列清空,将已解码的数据全部输出或抛弃,重新查找参数集,开始一个新的序列。这样,如果前一个序列出现重大错误,通过IDR图像可以获得重新同步的机会。IDR图像之后的图像不会使用IDR之前的图像的数据来解码。一个序列就是一段内容差异不太大的图像编码后生成的一串数据流。当运动变化比较少时,一个序列可以很长,因为运动变化少就代表图像画面的内容变动很小,所以就可以编码一个I帧,然后一直编码P帧、B帧。当运动变化比较多时,一个序列可能就比较短了,比如就包含一个I帧和3、4个P帧。
IDR帧:在视频编码中(H.264/H.265/H.266/AV1等),图像以序列为单位进行组织。一个序列的第一个图像为IDR图像(立即刷新图像),IDR图像都是I帧图像。
I帧:帧内编码帧,I帧表示关键帧,可以理解为这一帧画面的完整保留;解码时只需要本帧数据就可以完成(因为包含完整画面)。
IDR会导致参考帧列表(Decoded Picture Buffer,DPB)清空,而I帧不会。IDR帧图像一定是I帧图像,但I帧图像不一定是IDR帧图像。一个序列中可以有很多的I帧图像,I帧图像之后的图像可以引用其与I帧图像之间的图像做运动参考。
P帧:前向预测编码帧。P帧表示的是这一帧与之前的一个关键帧(或P帧)的差别,解码时需要使用之前缓存的画面叠加上本帧定义的差别生成最终画面。
P帧的预测与重构:P帧是以I帧为参考帧,在I帧中找出P帧“某点”的预测值和运动矢量,取预测差值和运动矢量一起传送。在接收端根据运动矢量从I帧中找出P帧“某点”的预测值,并与预测差值相加得到P帧“某点”样值,从而可得到完整的P帧。
B帧:双向预测内插编码帧。B帧是双向差别帧,也就是B帧记录的是本帧与前后帧的差别,B帧可以作为其它B帧的参考帧,也可以不作为其它B帧参考帧。要解码B帧,不仅要取得之前的缓存画面,还要解码之后的画面,通过前后画面与本帧数据的叠加取得最终的画面。B帧压缩率高,但是解码时CPU(Central Processing Unit)消耗较高。
B帧的预测与重构:B帧以前面的I帧或P帧和后面的P帧为参考帧,“找出”B帧“某点”的预测值和两个运动矢量,并取预测差值和运动矢量传送。接收端根据运动矢量在两个参考帧中“找出(算出)”预测值并与差值求和,得到B帧“某点”样值,从而可得到完整的B帧。
宏块:编码的基本单位,一个编码图像需要划分成多个块才能进行处理。
帧内预测:预测块是基于已编码重建块和当前块形成的块。
帧内(Intraframe)压缩也称为空间压缩(Spatial compression)。当压缩一帧图像时,仅考虑本帧的数据而不考虑相邻帧之间的冗余信息,与静态图像压缩类似。帧内一般采用有损压缩算法,由于帧内压缩是编码一个完整的图像,所以可以独立的解码、显示。帧内压缩一般达不到很高的压缩率。
帧间预测:主要包括运动估计(运动搜索方法、运动估计准则、亚像素插值和运动矢量估计)和运动补偿,是GOP(group of pictures)粒度时序上的参考和预测插值补偿。
帧间(Interframe)压缩的原理是:相邻几帧的数据有很大的相关性,或者说前后两帧信息之间具有变化很小的特点。也即连续的视频帧或相邻帧之间具有冗余信息,根据这一特性,压缩相邻帧之间的冗余量就可以进一步提高压缩率,减小压缩比。帧间压缩也称为时间压缩(Temporal compression),它通过比较时间轴上不同帧之间的数据进行压缩。
帧间压缩一般是无损的。帧差值(Frame differencing)算法是一种典型的时间压缩法,它通过比较本帧与相邻帧之间的差异,仅记录本帧与其相邻帧的差值,这样可以大大减少数据量。
SAD:Sum of Absolute Differenc=SAE(Sum of Absolute Error)即绝对误差和。
SATD:SATD(Sum of Absolute Transformed Difference)即哈达玛积矩阵运算(hadamard product)变换后再进行绝对值求和。
MC:运动补偿(Motion Compensation,MC)。
ME:运动估计(Motion Estimation,ME)。
Lookahead:作用是对主编码器模块尚未分析的帧进行编码成本估算,在当前编码评估帧前缓存一定配置长度的已经编码完的重建帧,给当前编码帧做帧间预测参考评估
bd-rate:评价视频编码算法性能的主要参数之一,表示新算法编码的视频相对于原来的算法在码率和峰值信噪比(Peak signal-to-noise ratio,PSNR)上的变化情况
GOP:group of pictures两个I帧之间的间隔。
minigop:在一个GOP内,两个P帧之间会有一定数据B帧,两个P帧之间的间隔即一个minigop。
率失真优化(Rate Distortion Optimation,RDO):编码过程中有许多的模式可以选择,有些模式的图像失真较小,但是码率却很大;有些模式的图像失真较大,但是码率却很小。相关技术正在研究能够在不超过某最大码率的情况下,使失真达到最小的模式(条件极值=>拉格朗日乘子法)。
近几年视频行业飞速发展,视频的应用正朝着高清晰度、高帧率方向快速升级,随着短视频、电商直播、实时云渲染等视频业务的快速发展,视频处理的需求越来越多,而视频编码作为视频处理的基础,优异的编码能力能够为产品提供高清、流畅的播放体验,对于体验质量(Quality of Experience,QoE)及服务质量(Quality of Service,QoS)提升有重要作用。
直播、实时通讯(Real-time Communication,RTC)、云渲染、云桌面等场景对视频编码内核的出帧稳定性都有比较高要求。视频编码内核算力跟视频画面的复杂性相关,静态、运动纹理变化小的画面比较容易压缩,消耗的编码算力相对较少,而对于运动纹理比较复杂的画面,压缩编码算力消耗较大,如果压缩视频的画面纹理比较复杂,场景切换比较频繁,对于视频编码压缩消耗的算力就会比较不均匀,算力波动比较大将导致编码处理的服务器CPU消耗波动比较大,CPU消耗波动比较大,一方面对于直播、实时通讯、云渲染、云桌面这类场景的出帧稳定性有比较大的影响,另一方面对于服务器部署成本要求也会比较高,比如编码视频算力波动比较大,在算力编排调度方面需要预留比较多算力buffer空间(缓冲区),以防止视频画面场景切换时的算力波动,比如,一台服务器同时跑10路直播视频编码流,调度时CPU的占用率就要尽量控制在50%以内,防止这10路视频流同时出现画面切换到纹理复杂场景时对编码算力消耗同时向上波动,导致服务器算力过载,视频编码出帧不稳定。
现有的视频编码内核在应用时设定好相关的编码参数(如编码复杂度、码率大小、lookahead参考帧个数、KEY GOP大小、是否开启B帧、编码码控方式、ME、MC相关算法、预处理是否启用相关算法等)。这些编码参数设置好以后,后面视频源输入编码时,与编码相关的一些处理算法和配置就固定下来了,比如高算力的编码单元划分、MC、ME、变换、预处理、lookahead等等。在同一视频中同时包含静态、运动纹理变化小的画面和运动纹理比较复杂、场景切换比较频繁的画面的情况下,对该视频中各帧画面均采用相同的编码参数,若编码参数设置的较多较复杂,则对于静态、运动纹理变化小的画面会导致较高的服务器部署成本,若编码参数设置的较少较简单,对于运动纹理比较复杂、场景切换比较频繁的画面,会导致视频编码压缩的算力不足,使得视频编码内核的出帧稳定性较差。
本申请实施例提供的视频处理方法,将对输入视频的编码任务分解为对输入视频包含的N个视频帧序列分别进行编码,对当前的视频帧序列(即第i个视频帧序列)进行编码时,根据编码和/或解码前一个视频帧序列(即第i-1个视频帧序列)时消耗的算力、当前
的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,适应性地确定当前的视频帧序列的编码参数,具体的,可以从保障出帧稳定性的角度出发,根据编码和/或解码前一个视频帧序列时消耗的算力大小、以及第一图像属性与第二图像属性之间的关系,适应性地确定当前的视频帧序列的编码参数相对于前一个视频帧序列的编码参数是保持不变,还是增大或减小,如此,借鉴这两个相邻的视频帧序列中视频帧图像的图像属性之间的关系,以编码和/或解码前一个视频帧序列时消耗的算力为基础,对当前的视频帧序列的编码参数进行设置,以使得所确定的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
为了便于理解,请参阅图1,图1为本申请实施例提供的视频处理方法的应用环境图,如图1所示,本申请实施例提供的视频处理方法应用于视频处理系统。视频处理系统包括:服务器和终端设备;其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端和服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例在此不做限制。
服务器首先获取输入视频的N个视频帧序列,其中,每个视频帧序列包括至少一个视频帧图像;其次,服务器从N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列;再次,服务器从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像,其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性;接着,服务器获取第i-1个视频帧序列对应的第一算力,其中,第一算力用于表征对第i-1个视频帧序列进行编码时消耗的算力;然后,服务器根据第i-1个视频帧序列对应的第一算力、第一图像属性及第二图像属性中的至少一项,确定第i个视频帧序列的编码参数,以便根据该编码参数对第i个视频帧序列进行编码。
下面对本申请中视频处理方法进行介绍,该视频处理方法的执行主体为计算机设备,例如可以为服务器。请参阅图2a和图2b,图2a为本申请实施例提供的视频处理方法的流程示意图,图2b为本申请实施例提供的视频处理方法的实现架构示意图,本申请实施例提供的视频处理方法包括:步骤S110至步骤S160。具体的:
S110、获取输入视频的N个视频帧序列。
其中,每个视频帧序列包括至少一个视频帧图像,N为大于1的整数。
可以理解的是,输入视频为待处理的视频或者是待视频编码的视频。
需要说明的是,在本申请实施例中,用于执行本申请实施例提供的方法的计算机设备可以先获取输入视频,然后对输入视频进行分割处理,得到该输入视频中的N个视频帧序列。或者,也可以由其它设备执行对于输入视频的分割处理,得到该输入视频中的N个视频帧序列,进而,将该输入视频中的N个视频帧序列传输给用于执行本申请实施例提供的
方法的计算机设备,使该计算机设备获得输入视频中的N个视频帧序列。本申请实施例在此不对输入视频中的N个视频帧序列的获取方式做任何限定。
对输入视频进行分割,可以是通过场景识别模型对输入视频进行分割,也可以是设置固定时间或者帧数对输入视频进行分割,还可以是人工对输入视频进行分割,本申请实施例在此不作限制。
通过场景识别模型对输入视频进行分割是指,通过训练好的场景识别模型,识别出输入视频中的N个场景,以场景切换作为视频分割点,对输入视频进行切分,得到N个子视频段,将子视频段以视频帧序列的方式进行表示,每个视频帧序列包括该子视频段中的全部帧图像。场景识别模型可以是基于深度学习和神经网络的特征提取分类模型,常用的算法有CNN(Convolutional Neural Network)、决策树,随机森林等。以深度学习为例,利用深度学习CNN、CNN+RNN(Recurrent Neural Network)等算法框架对视频帧生成图片,用视频帧生成的图片做深度学习相关模型时的训练样本。
通过设置固定时间或者帧数对输入视频进行分割是指,通过预先设定视频分割时间或视频分割帧数,以设定值对输入视频进行分割,从而得到分割后的若干子视频段,将子视频段以视频帧序列的方式进行表示,每个视频帧序列包括该子视频段中的全部帧图像。
通过人工对输入视频进行分割是指,由人工将视频中的场景切换作为视频分割点,对输入视频进行切分,得到N个子视频段,将子视频段以视频帧序列的方式进行表示,每个视频帧序列包括该子视频段中的全部帧图像。
举例说明,输入视频中包含3个场景:会议场景、剧场场景及游泳场景,其中,会议场景的场景复杂度为简单,剧场场景的场景复杂度为中等,游泳场景的场景复杂度为复杂。通过场景识别模型对输入视频进行分割的方式为:将输入视频作为训练好的场景识别模型的输入,场景识别模型识别出该输入视频包含3个场景,输出3个场景对应的3个子视频段,并将每个子视频段以视频帧序列的方式进行表示。通过设置固定时间或者帧数对输入视频进行分割方式为:设置以每15秒为视频分割间隔,对输入视频进行分割,得到若干子视频段,使得每个子视频段的时间均为15秒或少于15秒(如输入视频的最后一个子视频段的时长可以不足15秒)。通过人工对输入视频进行分割是指,由人工将视频中的场景切换作为视频分割点对输入视频进行切分,在每个场景切换时,进行视频分割,将一个输入视频分割为3个子视频段,并将每个子视频段以视频帧序列的方式进行表示。
S120、从N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列。
其中,第i个视频帧序列与第i-1个视频帧序列在目标视频中相邻,i为大于1的整数,例如i为大于1的自然数。
可以理解的是,从N个视频帧序列中,获取两个时域连续的视频帧序列,将时域靠前的视频帧序列作为第i-1个视频帧序列,将时域靠后的的视频帧序列作为第i个视频帧序列。举例说明,将目标视频分割为5个视频帧序列,获取第一视频帧序列和第二视频帧序列,或者获取第二视频帧序列和第三视频帧序列,或者获取第三视频帧序列和第四视频帧序列,或者获取第四视频帧序列和第五视频帧序列。
S130、从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像。
其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性,第一图像属性和第二图像属性分别用于表征其对应的视频帧图像的纹理复杂度信息和/或场景复杂度信息。
可以理解的是,第一视频帧图像可以为第i个视频帧序列中的IDR帧,第二视频帧图像可以为第i-1个视频帧序列中IDR帧。每个视频帧图像均对应有图像属性,图像属性用于表征视频帧图像的纹理复杂度信息及场景复杂度信息。应理解,视频帧图像对应的基本图像属性可以包括像素、分辨率、大小、颜色、位深、色调、饱和度、亮度、色彩通道、图像的层次等;在本申请实施例中,第一图像属性和第二图像属性可以为上述基本图像属性中的至少一种,或者多种的组合,或者基于至少一种或多种基本图像属性确定的其它形式的属性表示信息,其能够表征对应的视频帧图像的纹理复杂度信息和/或场景复杂度信息。此外,视频帧图像对应的图像属性也可以利用解码该视频帧图像所耗费的算力资源来表示,在一些情况下,当前待编码的输入视频需要通过解码处理得到,在解码得到输入视频的过程中,可以记录解码其中每个视频帧图像所耗费的算力资源,进而,据此确定该视频帧图像对应的图像属性,应理解,解码某视频帧图像所耗费的算力资源越多,则说明该视频帧图像对应的纹理复杂度越高和/或场景复杂度越高。即,本申请实施例中的第一图像属性和第二图像属性只要能表征对应的视频帧图像的纹理复杂度信息和/或场景复杂度信息即可,本申请实施例对其表现形式不作任何限定。
需要说明的是,上述第一图像属性可以是获取到第一视频帧图像后,通过对该第一视频帧图像进行识别或提取得到的;或者,也可以是预先确定并存储的信息,获取到第一视频帧图像后,可以从所存储的信息中调取该第一视频帧图像对应的第一图像属性。相类似的,上述第二图像属性可以是获取到第二视频帧图像后,通过对该第二视频帧图像进行识别或提取得到的;或者,也可以是预先确定并存储的信息,获取到第二视频帧图像后,可以从所存储的信息中调取该第二视频帧图像对应的第二图像属性。本申请实施例在此不对该第一图像属性和第二图像属性的获取方式做任何限定。
纹理复杂度信息包括简单纹理、一般纹理、中等纹理及复杂纹理。对于简单纹理的视频帧图像进行编码时,可以采用较少较简单的编码参数进行编码,且消耗的算力较少;对于复杂纹理的视频帧图像进行编码时,为保证编码质量(较高的出帧稳定性)需采用较多较复杂的编码参数进行编码,且消耗的算力较多。
纹理复杂度信息分析方法有:欧几里得距离、统计直方图、LBP(Local Binary Pattern)检测算法以及CNN特征提取分类等算法,基于边缘特征的画面复杂度估计方法,常用的算法有canny、sobel、robert等边缘检测算子算法等,本申请在此不做限制。
以LBP检测算法为例:LBP指局部二值模式,是一种用来描述图像局部特征的算子,LBP特征具有灰度不变性和旋转不变性等显著优点。原始的LBP算子定义为在3×3的窗口内,以窗口中心像素为阈值,将相邻的8个像素的灰度值与其进行比较,若周围像素值大于中心像素值,则该像素点的位置被标记为1,否则为0。这样,3×3邻域内的8个像
素经比较可产生8位二进制数(通常转换为十进制数即LBP码,共256种),即得到该窗口中心像素点的LBP值,并用这个值来反映该区域的纹理信息,检测时可以根据视频画面分辨率和算力大小自动调整检测区域,或通过下采样调整画面分辨率大小。综合画面检测的区域LBP值汇总计算,如LBP值80%都集中50以下的为简单纹理,50-100的为一般纹理,100-150的为中等纹理,大于150的为复杂纹理。
场景复杂度信息包括简单场景、一般场景、中等场景及复杂场景。对于简单场景视频帧图像进行编码时,可以采用较少较简单的编码参数进行编码,且消耗的算力较少;对于复杂场景的视频帧图像进行编码时,为保证编码质量(较高的出帧稳定性)需采用较多较复杂的编码参数进行编码,且消耗的算力较多。例如,简单场景包括桌面场景、会议场景等,一般场景包括秀场、电视剧场景等,中等场景包括动漫场景、户外场景等,复杂场景包括游戏场景、游泳场景等。
场景复杂度信息可以采用基于深度学习和神经网络的特征提取分类网络,对场景复杂度进行分类,得到简单场景、一般场景、中等场景及复杂场景四类,常用的算法有CNN、决策树,随机森林等。
以深度学习为例,利用深度学习CNN、CNN+RNN等算法框架对视频帧图像进行识别,用视频帧图像作为深度学习相关模型时的训练样本,当训练样本足够情况下,画面特征比较明显的场景(例如游戏场景、足球场景、蓝球场景、动漫场景等),通过纯CNN网络(卷积层、过滤、池化层等处理)模型进行场景识别的准确率可达到99%以上;画面特征比较分散的场景(例如电视剧场景、户外运动场景、美食场景、旅游场景等),通过CNN结合RNN+LSTM做时域+频域分析进行场景识别的准确率可达到90%左右。
S140、获取第i-1个视频帧序列对应的第一算力。
其中,第一算力用于表征对第i-1个视频帧序列进行编码和/或解码时消耗的算力,例如可以是对第i-1个视频帧序列进行视频编码、音频编码、视频解码、音频解码等处理时,消耗的服务器算力,当然,当针对第i-1个视频帧序列执行上述处理的设备为其它设备时,该第一算力相应地为该设备消耗的算力,本申请实施例对此不作任何限定。
可以理解的是,本申请实施例目的在于对i个视频帧序列进行视频编码,在对i个视频帧序列进行视频编码之前,已经完成了对第i-1个视频帧序列进行视频编码的任务,并在完成对第i-1个视频帧序列进行视频编码的任务后,计算对第i-1个视频帧序列进行视频编码的服务器消耗值(视频编码时占用的服务器计算量),将计算得到的服务器消耗值作为第i-1个视频帧序列对应的第一算力。同样的,在对i个视频帧序列进行视频编码后,需计算对i个视频帧序列进行视频编码时消耗的第二算力,该第二算力会作为对第i+1个视频帧序列的视频编码参数进行调整的考虑因素。相类似的,也可以计算对第i-1个视频帧序列进行视频解码的服务器消耗值(视频编码时占用的服务器计算量),将计算得到的服务器消耗值作为第i-1个视频帧序列对应的第一算力。
需要说明的是,当第一算力为对第i-1个视频帧序列进行解码时消耗的算力,该第一算力可以表征该第i-1个视频帧序列中视频帧图像的纹理复杂度和/或场景复杂度,即该第一算力越大,则表示第i-1个视频帧序列中视频帧图像的纹理复杂度越高和/或场景复杂度
越高。
S150、根据第一算力、第一图像属性及第二图像属性中的至少一项,确定第i个视频帧序列的编码参数。
可以理解的是,基于第i-1个视频帧序列对应的第一算力、第一图像属性及第二图像属性中的至少一项,在第i-1个视频帧序列的视频编码参数的基础上调整确定第i个视频帧序列的编码参数。
若第一算力大于第一算力阈值,则下调第i个视频帧序列的编码参数。第一算力阈值是指服务器为编码提供的一种算力阈值。例如,服务器为编码提供的第一算力阈值可以为1000,而第i-1个视频帧序列对应的第一算力超过1000,则在对第i个视频帧序列进行编码时需要降低算力,为降低对第i个视频帧序列进行编码时的算力,则需要下调第i个视频帧序列的编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级,则下调第i个视频帧序列的编码参数,其中,第一算力阈值大于第二算力阈值。第一算力阈值和第二算力阈值是指服务器为编码提供的两种不同的算力阈值。例如,服务器为编码提供的算力区间为800至1000,即第一算力阈值可以为1000,第二算力阈值可以为800,第i-1个视频帧序列对应的第一算力为900,第一算力大于第二算力阈值且小于第一算力阈值。由于属性等级越大(纹理复杂度越高和/或场景复杂度越高),需要的算力越高。第一图像属性的属性等级大于第二图像属性的属性等级,表示对第i个视频帧序列进行编码时需要的算力会大于对第i-1个视频帧序列进行视频编码时的算力,而需要降低对第i个视频帧序列进行编码时的算力,为降低对第i个视频帧序列进行编码时的算力,则需要下调第i个视频帧序列的编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级,则保持第i个视频帧序列的编码参数与第i-1个视频帧序列的编码参数相同。第一算力阈值和第二算力阈值是指服务器为编码提供的两种不同的算力阈值。例如,服务器为编码提供的算力区间为800至1000,即第一算力阈值可以为1000,第二算力阈值可以为800,第i-1个视频帧序列对应的第一算力为900,第一算力大于第二算力阈值且小于第一算力阈值。由于属性等级越大(纹理复杂度越高和/或场景复杂度越高),需要的算力越高。第一图像属性的属性等级等于第二图像属性的属性等级,表示对第i个视频帧序列进行编码时需要的算力会等于或接近对第i-1个视频帧序列进行编码时的算力,而无需调整第i个视频帧序列的编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级,则上调第i个视频帧序列的编码参数。第一算力阈值和第二算力阈值是指服务器为编码提供的两种不同的算力阈值。例如,服务器为编码提供的算力区间为800至1000,即第一算力阈值可以为1000,第二算力阈值可以为800,第i-1个视频帧序列的编码的第一算力为900,第一算力大于第二算力阈值且小于第一算力阈值。由于属性等级越大(纹理复杂度越高和/或场景复杂度越高),需要的算力越高。第一图像属性的属性等级小于第二图像属性的属性等级,表示对第i个视频帧序列进行编码时需要的算
力会小于对第i-1个视频帧序列进行编码时的的算力,为提高编码质量,可以上调第i个视频帧序列的编码参数。
若第一算力小于第二算力阈值,则上调第i个视频帧序列的编码参数。第二算力阈值是指服务器为编码提供的一种算力阈值。例如,服务器为编码提供的第二算力阈值可以为800,第i-1个视频帧序列的编码的第一算力为700,第一算力小于第二算力阈值。为提高编码质量,可以上调第i个视频帧序列的编码参数。
需要说明的是,上述编码参数具体可以为视频编码参数、音频编码参数等,当编码参数为视频编码参数时,该编码参数可以包括但不限于编码单元划分深度、预测单元划分深度、运动估计参数、运动补偿参数、变换单元划分深度等,本申请实施例在此不对编码参数具体包括的内容做任何限定。
S160、根据第i个视频帧序列的编码参数对第i个视频帧序列进行编码,得到第i个编码视频段。
可以理解的是,根据第i个视频帧序列的编码参数对第i个视频帧序列进行编码具体是指,根据第i个视频帧序列的编码参数,对第i个视频帧序列中的全部视频帧图像进行编码,得到每个视频帧图像的编码图像,由全部的编码图像构成第i个编码视频段。
需要说明的是,在本申请实施例中,步骤S160为可选执行的步骤。在一种可能的情况下,可以由某计算机设备执行上述步骤S110至步骤S150,以确定第i个视频帧序列的编码参数,进而,将该编码参数传输给其它的视频编码设备,以使该视频编码设备根据该编码参数对输入视频中第i个视频帧序列进行编码,得到第i个编码视频片段。在另一种可能的情况下,可以由同一计算机设备执行上述步骤S110至步骤S160,即由同一计算机设备确定第i个视频帧序列的编码参数,并据此对输入视频中第i个视频帧序列进行编码,得到第i个编码视频片段。本申请实施例在此不对第i个视频帧序列的编码处理的执行主体做任何限定。
需要说明的是,在本申请实施例中,将未编码前的视频数据称为视频帧序列,将编码后的视频数据称为编码视频段。视频帧序列中包括各帧彼此独立的视频帧图像,编码视频段中包括各帧视频帧图像各自对应的编码数据,该编码数据可以表示该视频帧图像本身的内容,也可以表示该视频帧图像与相邻的其它视频帧图像之间的区别。通常情况下,视频帧序列与编码视频段具有不同的格式,编码视频段可以理解为压缩后的视频帧序列,编码视频段的文件大小小于视频帧序列。
本申请实施例提供的视频处理方法,对输入视频进行编码时,若输入视频中的场景唯一且固定,则根据该场景匹配编码参数,通过匹配的编码参数对该输入视频进行编码,使得该编码参数满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。若输入视频中包括多个场景,则根据场景变化将输入视频进行切分得到N个视频帧序列,将对输入视频的编码任务分解为对输入视频包含的N个视频帧序列分别进行编码,对当前的视频帧序列(即第i个视频帧序列)进行编码时,根据编码和/或解码前一个视频帧序列(即第i-1个视频帧序列)时消耗的算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,
适应性地确定当前的视频帧序列的编码参数,具体的,可以从保障出帧稳定性的角度出发,根据编码和/或解码前一个视频帧序列时消耗的算力大小、以及第一图像属性与第二图像属性之间的关系,适应性地确定当前的视频帧序列的编码参数相对于前一个视频帧序列的编码参数是保持不变,还是增大或减小,如此,借鉴这两个相邻的视频帧序列中视频帧图像的图像属性之间的关系,以编码和/或解码前一个视频帧序列时消耗的算力为基础,对当前的视频帧序列的编码参数进行设置,以使得所确定的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图3a和图3b,编码参数包括编码单元划分深度,图3a为编码单元划分深度的确定流程示意图,图3b为编码单元划分深度的确定架构示意图;步骤S150包括子步骤S1511至子步骤S1514。具体的:
S1511、获取第i-1个视频帧序列的第二编码单元划分深度。
S1512、在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度,至低于第二编码单元划分深度。
S1513、在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一编码单元划分深度与第i-1个视频帧序列的第二编码单元划分深度相等。
S1514、在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度,至高于第二编码单元划分深度。
可以理解的是,请参阅图4,图4是本申请实施例提供的编码单元深度划分的示意图。对视频帧序列中的视频帧图像进行编码时,将视频帧图像送入编码器,先按照64×64块大小分割成一个个编码树单元(Coding Tree Uint,CTU),然后对每个CTU进行深度划分,得到编码单元(Coding Uint,CU),即编码单元为对编码树单元进行深度划分得到的更细粒度的单元。对每个CTU进行深度划分采用由上向下的划分规则,如表1所示,深度为0时,即depth=0,保持CU块大小为64×64,即此时1个CTU包括1个64×64的CU块;深度为1时,即depth=1,将64×64的CU块划分为4个32×32的CU块,即此时1个CTU包括4个32×32的CU块;深度为2时,即depth=2,将每个32×32的CU块划分为4个16×16的CU块,即此时1个CTU包括16个16×16的CU块;深度为3时,即depth=3,将每个16×16的CU块划分为4个8×8的CU块,即此时1个CTU包括64个8×8的CU块。编码单元划分深度越大,对图像进行编码时需要的算力就越大,当需要降低算力时,可以降低编码单元的划分深度。
表1
应理解,上述图4和表1所示的编码单元深度划分示意图仅为示例,即图4和表1中所示的具体数值仅为示例,在本申请实施例中并不具体限定编码树单元和编码单元的划分方式。
当满足条件一或条件二之一时,需要降低对第i个视频帧序列进行视频编码时的算力。条件一:第一算力大于第一算力阈值。条件二:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级。通过减少编码单元的划分深度,实现降低对第i个视频帧序列进行视频编码时的算力。
需要说明的是,第一算力阈值和第二算力阈值是指为编码提供的两种不同的算力阈值,其可以根据实际应用中的算力消耗情况自主定义。例如,服务器为编码提供的算力区间为800至1000,即第一算力阈值可以设置为1000,第二算力阈值可以设置为800。
需要说明的是,上述属性等级是根据对应的图像属性确定的,在图像属性用于表征其对应的视频帧图像的纹理复杂度信息和/或场景复杂度信息的情况下,该属性等级用于表征视频帧图像的纹理复杂度等级和/或场景复杂度等级;通常情况下,属性等级越高,则表示对应的视频帧图像的纹理复杂度越高和/或场景复杂度越高。
在一种可能的实现方式中,可以通过神经网络模型,根据视频帧图像本身和/或视频帧图像的图像属性,确定上述属性等级。在另一种可能的实现方式中,可以按照预先设定的属性等级划分规则,根据视频帧图像的图像属性,来确定该图像属性对应的属性等级。本申请实施例在此不对上述属性等级的确定方式作任何限定。
示例性的,图像属性的属性等级可以包括三级,分别是一级、二级和三级,属性等级越高,则表示对应的视频帧图像的纹理复杂度越高和/或场景复杂度越高。当然,在实际应用中,也可以划分更多或更少的属性等级,本申请实施例对此不作任何限定。
举例说明,对第i-1个视频帧序列进行视频编码时的编码单元划分深度为depth=3,即将每个CTU划分为64个8×8的CU块,此时对第i-1个视频帧序列进行视频编码消耗的第一算力超过第一算力阈值,需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将第i个视频帧序列进行视频编码时的编码单元划分深度调整为depth=2,即将每个CTU划分为16个16×16的CU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的编码单元划分深度(第一编码单元划分深度depth=2)低于对第i-1个视频帧序列进行视频编码时的编码单元划分深度(第二编码单元划分深度depth=3)。
对第i-1个视频帧序列进行视频编码时的编码单元划分深度为depth=3,即将每个CTU划分为64个8×8的CU块,此时第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级小于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景
复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息升高),需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将第i个视频帧序列进行视频编码时的编码单元划分深度调整为depth=2,即将每个CTU划分为16个16×16的CU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的编码单元划分深度(第一编码单元划分深度depth=2)低于对第i-1个视频帧序列进行视频编码时的编码单元划分深度(第二编码单元划分深度depth=3)。
当满足条件三时,无需调整对第i个视频帧序列进行视频编码时的算力。条件三:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级。由于无需调整第i个视频帧序列进行视频编码时的算力,所以可以保持第i个视频帧序列的第一编码单元划分深度与第i-1个视频帧序列的第二编码单元划分深度相等。
举例说明,对第i-1个视频帧序列进行视频编码时的编码单元划分深度为depth=2,即将每个CTU划分为16个16×16的CU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级等于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息无变化),无需调整第i个视频帧序列的视频编码的算力。可以保持第i个视频帧序列的第一编码单元划分深度与第i-1个视频帧序列的第二编码单元划分深度相等。即对第i个视频帧序列进行视频编码时的编码单元划分深度为depth=2,即将每个CTU划分为16个16×16的CU块。
当满足条件四或条件五之一时,为提高出帧稳定性,可提高第i个视频帧序列进行视频编码时的算力。条件四:第一算力小于第二算力阈值。条件五:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级。通过增加编码单元划分深度,实现增加对第i个视频帧序列进行视频编码时的算力。
举例说明,对第i-1个视频帧序列进行视频编码时的编码单元划分深度为depth=2,即将每个CTU划分为16个16×16的CU块,此时对第i-1个视频帧序列进行视频编码的第一算力小于第二算力阈值,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。为提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时的编码单元划分深度调整为depth=3,即将每个CTU划分为64个8×8的CU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的编码单元划分深度(第一编码单元划分深度depth=3)高于对第i-1个视频帧序列进行视频编码时的编码单元划分深度(第二编码单元划分深度depth=2)。
对第i-1个视频帧序列进行视频编码时的编码单元划分深度为depth=2,即将每个CTU划分为16个16×16的CU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级大于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及
场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息降低),为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时的编码单元划分深度调整为depth=3,即将每个CTU划分为64个8×8的CU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的编码单元划分深度(第一编码单元划分深度depth=3)高于对第i-1个视频帧序列进行视频编码时的编码单元划分深度(第二编码单元划分深度depth=2)。
本申请实施例提供的方法,对当前的视频帧序列进行编码时,根据前一个的视频帧序列消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性,适应性地调整当前的视频帧序列的编码单元划分深度,从而使得调整后的编码单元划分深度可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图5a和图5b,编码参数包括预测单元划分深度,图5a为预测单元划分深度的确定流程示意图,图5b为预测单元划分深度的确定架构示意图;步骤S150包括子步骤S1521至子步骤S1524。具体的:
S1521、获取第i-1个视频帧序列的第二预测单元划分深度。
S1522、在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,至低于第二预测单元划分深度。
S1523、在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一预测单元划分深度与第i-1个视频帧序列的第二预测单元划分深度相等。
S1524、在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,至高于第二预测单元划分深度。
可以理解的是,请参阅图6,图6是本申请实施例提供的预测单元深度划分的示意图。对视频帧序列中的视频帧图像进行编码时,将视频帧图像送入编码器,先按照64×64块大小分割成一个个编码树单元CTU,然后对每个CTU进行深度划分,得到编码单元CU。每个编码单元CU包括预测单元(Predict Unit,PU)和变换单元(TransformUnit,TU),即预测单元为对编码单元进行深度划分得到的更细粒度的单元。
对每个CU进行深度划分采用由上向下的划分规则,深度为0时,即depth=0,保持每个CU中预测单元PU的大小与变换单元CU的大小相等,例如CU块大小为64×64,则PU的大小也是64×64;CU块大小为32×32,则PU的大小也是32×32;CU块大小为16×16,则PU的大小也是16×16;CU块大小为8×8,则PU的大小也是8×8。深度为1时,即depth=1,将CU块划分为2个PU块,此时划分方式包括2种均匀划分和4种不均
匀划分:例如CU块大小为64×64,若均匀划分,则两个PU的大小均为是64×32,或者两个PU的大小均为是32×64,若不均匀划分,则两个PU的大小分别是64×16和64×48,或者两个PU的大小分别是64×48和64×16,或者两个PU的大小分别是16×64和48×64,或者两个PU的大小分别是48×64和16×64。深度为2时,即depth=2,将CU块划分为4个PU块,例如CU块大小为64×64,则PU的大小是32×32;CU块大小为32×32,则PU的大小是16×16;CU块大小为16×16,则PU的大小是8×16;CU块大小为8×8,则PU的大小是4×4。由于预测单元的划分深度越大,则对图像进行编码时需要的算力就越大,当需要降低算力时,可以降低预测单元的划分深度。
应理解,上述图6所示的预测单元深度划分示意图仅为示例,即图6中所示的具体数值仅为示例,在本申请实施例中并不具体限定编码单元和预测单元的划分方式。
当满足条件一或条件二之一时,需要降低对第i个视频帧序列进行视频编码时的算力。条件一:第一算力大于第一算力阈值。条件二:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级。通过减少预测单元的划分深度,实现降低对第i个视频帧序列进行视频编码时的算力。
关于第一算力阈值、第二算力阈值的解释内容已在上文中介绍,详细可参见上文的相关介绍内容。同样的,关于属性等级的解释内容已在上文中介绍,详细可参见上文的相关介绍内容。
举例说明,对第i-1个视频帧序列进行视频编码时的预测单元划分深度为depth=2,即将每个CU划分为4个PU块,此时对第i-1个视频帧序列进行视频编码的第一算力超过第一算力阈值,需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时的预测单元划分深度调整为depth=1,即将每个CU划分为2个PU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的预测单元划分深度(第一预测单元划分深度depth=1)低于对第i-1个视频帧序列进行视频编码时的预测单元划分深度(第二预测单元划分深度depth=2)。
对第i-1个视频帧序列进行视频编码时的预测单元划分深度为depth=2,即将每个CU划分为4个PU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级小于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息升高),需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时的预测单元划分深度调整为depth=1,即将每个CU划分为2个PU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的预测单元划分深度(第一预测单元划分深度depth=1)低于对第i-1个视频帧序列进行视频编码时的预测单元划分深度(第二预测单元划分深度depth=2)。
当满足条件三时,无需调整对第i个视频帧序列进行视频编码时的算力。条件三:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像
属性的属性等级。由于无需调整对第i个视频帧序列进行视频编码时的算力,所以可以保持第i个视频帧序列的第一预测单元划分深度与第i-1个视频帧序列的第二预测单元划分深度相等。
举例说明,对第i-1个视频帧序列进行视频编码时的预测单元划分深度为depth=1,即将每个CU划分为2个PU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级等于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息无变化),无需调整第i个视频帧序列的视频编码的算力。可以保持第i个视频帧序列的第一预测单元划分深度与第i-1个视频帧序列的第二预测单元划分深度相等。即对第i个视频帧序列进行视频编码时的预测单元划分深度为depth=1,即将每个CU划分为2个PU块。
当满足条件四或条件五之一时,为提高出帧稳定性,可提高第i个视频帧序列进行视频编码时的算力。条件四:第一算力小于第二算力阈值。条件五:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级。通过增加预测单元划分深度,实现增加对第i个视频帧序列进行视频编码时的算力。
举例说明,对第i-1个视频帧序列进行视频编码时的预测单元划分深度为depth=0,即将每个CU划分为1个PU块,此时对第i-1个视频帧序列进行视频编码的第一算力小于第二算力阈值,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。为提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时的预测单元划分深度调整为depth=1,即将每个CU划分为2个PU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的预测单元划分深度(第一预测单元划分深度depth=1)高于对第i-1个视频帧序列进行视频编码时的预测单元划分深度(第二预测单元划分深度depth=0)。
对第i-1个视频帧序列进行视频编码时的预测单元划分深度为depth=0,即将每个CU划分为1个PU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级大于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息降低),为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力,则将第i个视频帧序列进行视频编码时的预测单元划分深度调整为depth=1,即将每个CU划分为2个PU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的预测单元划分深度(第一预测单元划分深度depth=1)高于对第i-1个视频帧序列进行视频编码时的预测单元划分深度(第二预测单元划分深度depth=0)。
本申请实施例提供的方法,对当前的视频帧序列进行编码时,根据前一个的视频帧序列消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性,适应性地调整当前的视频帧序列的预测单元划分深度,从而使得调整后的预测单元划分深度可以满足对当前的视频帧序列的编码需
求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图7a和图7b,编码参数包括运动估计参数及运动补偿参数,图7a为运动估计参数及运动补偿参数的确定流程示意图,图7b为运动估计参数及运动补偿参数的确定架构示意图;步骤S150包括子步骤S1531至S1532。具体的:
S1531、获取第i-1个视频帧序列的第二运动估计参数及第二运动补偿参数。
S1532、在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,根据第二运动补偿参数调整第i个视频帧序列的第一运动补偿参数。
其中,第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定;第一最大像素范围小于第二最大像素范围,第一亚像素估计复杂度小于第二亚像素估计复杂度。第一运动补偿参数通过第一搜索范围确定,第二运动补偿参数通过第二搜索范围确定;第一搜索范围小于第二搜索范围。
S1533、在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一运动估计参数与第i-1个视频帧序列的第二运动估计参数相等,以及保持第i个视频帧序列的第一运动补偿参数与第i-1个视频帧序列的第二运动补偿参数相等。
其中,第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定;第一最大像素范围等于第二最大像素范围,第一亚像素估计复杂度等于第二亚像素估计复杂度。第一运动补偿参数通过第一搜索范围确定,第二运动补偿参数通过第二搜索范围确定;第一搜索范围等于第二搜索范围。
S1534、在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,以及根据第二运动补偿参数调整第i个视频帧序列的第一运动补偿参数。
其中,第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定;第一最大像素范围大于第二最大像素范围,第一亚像素估计复杂度大于第二亚像素估计复杂度。第一运动补偿参数通过第一搜索范围确定,第二运动补偿参数通过第二搜索范围确定;第一搜索范围大于第二搜索范围。
可以理解的是,请参阅图8,图8是本申请实施例提供的运动估计示意图。运动估计为针对当前帧的某个区域(A)在参考帧中寻找一个合适的匹配区域(B),运动估计参数相应地即是针对当前帧中的某个区域在参考帧中寻找与其匹配的区域时依据的寻找参数。参考帧可以是当前帧之前的帧,也可以是当前帧后面的帧。运动估计参数包括用于控制运动搜
索的最大像素范围和亚像素估计复杂度。控制运动搜索的最大像素范围是以像素为单位的控制最大运动搜索范围,包括:DIA(diamond菱形)、hex(hexagon六角形)、umh(uneven multi-hex非偶多六角形)、esa(exhaustive穷举)、tesa(transformed exhaustive改进的穷举);其中,从DIA、hex、umh、esa到tesa所需要的算力依次增加,例如DIA消耗的算力最小,tesa消耗的算力最大。亚像素估计复杂度用于表征运动估计的复杂度,其分为0-10共11个等级,复杂度越高消耗的算力越大,例如亚像素估计复杂度10消耗的算力大于亚像素估计复杂度0消耗的算力。
请参阅图9,图9是本申请实施例提供的运动补偿的示意图。运动补偿的目的在于找到区域A和区域B的不同。运动补偿参数包括搜索范围,搜索范围越大消耗的算力越大。通过运动补偿和运动估计预测性编码会产生一些运动矢量和残差。运动矢量就是某些区域针对参考帧的运动轨迹,而残差就是这些区域运动后产生的预测帧和当前帧之间的不同。
当满足条件一或条件二之一时,需要降低对第i个视频帧序列进行视频编码时的算力。条件一:第一算力大于第一算力阈值。条件二:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级。通过降低第i个视频帧序列的第一运动估计参数及第一运动补偿参数,实现降低对第i个视频帧序列进行视频编码时的算力。
举例说明,对第i-1个视频帧序列进行视频编码时采用的第二运动估计参数包括控制运动搜索的第二最大像素范围为tesa,第二亚像素估计复杂度为10级,此时对第i-1个视频帧序列进行视频编码的第一算力超过第一算力阈值,需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时采用的第一运动估计参数调整为控制运动搜索的第以最大像素范围为umh,第二亚像素估计复杂度为8级,满足降低第i个视频帧序列的视频编码的算力的要求。第一最大像素范围(umh)小于第二最大像素范围(tesa),第一亚像素估计复杂度(8级)小于第二亚像素估计复杂度(10级)。同时,降低第i个视频帧序列视频编码时采用的第一运动补偿参数,使得第一运动补偿参数中的第一搜索范围小于对第i-1个视频帧序列进行视频编码时采用的第二运动补偿参数的第二搜索范围。
对第i-1个视频帧序列进行视频编码时采用的第二运动估计参数包括控制运动搜索的第二最大像素范围为tesa,第二亚像素估计复杂度为10级,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级小于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息升高),需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时采用的第一运动估计参数调整为控制运动搜索的第一最大像素范围为umh,第一亚像素估计复杂度为8级,满足降低第i个视频帧序列的视频编码的算力的要求。第一最大像素范围(umh)小于第二最大像素范围(tesa),第一亚像素估计复杂度(8级)小于第二亚像素估计复杂度(10级)。同时,降低对第i个视频帧序列进行视频编码时采用的第一运动补偿参数,使得第一运动
补偿参数中的第一搜索范围小于对第i-1个视频帧序列进行视频编码时采用的第二运动补偿参数的第二搜索范围。
当满足条件三时,无需调整第i个视频帧序列进行视频编码时的算力。条件三:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级。由于无需调整对第i个视频帧序列进行视频编码时的算力,所以可以保持第i个视频帧序列的第一运动估计参数与第i-1个视频帧序列的第二运动估计参数相等,以及保持第i个视频帧序列的第一运动补偿参数与第i-1个视频帧序列的第二运动补偿参数相等。
举例说明,对第i-1个视频帧序列进行视频编码时采用的第二运动估计参数包括控制运动搜索的第二最大像素范围为esa,第二亚像素估计复杂度为9级,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中第二视频帧图像的属性等级等于第i个视频帧序列中第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息无变化),无需调整第i个视频帧序列的视频编码的算力。可以保持第i个视频帧序列的第一运动估计参数与第i-1个视频帧序列的第二运动估计参数相等,以及保持第i个视频帧序列的第一运动补偿参数与第i-1个视频帧序列的第二运动补偿参数相等;即对第i个视频帧序列进行视频编码时采用的第一运动估计参数包括控制运动搜索的第一最大像素范围为esa,第一亚像素估计复杂度为9级。同时,保持第一运动补偿参数中的第一搜索范围等于对第i-1个视频帧序列进行视频编码时采用的第二运动补偿参数中的第二搜索范围。
当满足条件四或条件五之一时,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。条件四:第一算力小于第二算力阈值。条件五:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级。通过提高第i个视频帧序列的第一运动估计参数及第一运动补偿参数,实现提高对第i个视频帧序列进行视频编码时的算力。
举例说明,对第i-1个视频帧序列进行视频编码时采用的第二运动估计参数包括控制运动搜索的第二最大像素范围为umh,第二亚像素估计复杂度为8级,此时对第i-1个视频帧序列进行视频编码的第一算力小于第二算力阈值,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。为提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时采用的第一运动估计参数调整为控制运动搜索的第以最大像素范围为esa,第二亚像素估计复杂度为9级,满足提高第i个视频帧序列的视频编码的算力的要求。第一最大像素范围(esa)大于第二最大像素范围(umh),第一亚像素估计复杂度(9级)大于第二亚像素估计复杂度(8级)。同时,提高对第i个视频帧序列进行视频编码时采用的第一运动补偿参数,使得第一运动补偿参数中的第一搜索范围大于对第i-1个视频帧序列进行视频编码时采用的第二运动补偿参数中的第二搜索范围。
对第i-1个视频帧序列进行视频编码时采用的第二运动估计参数包括控制运动搜索的第二最大像素范围为umh,第二亚像素估计复杂度为8级,此时对第i-1个视频帧序列进
行视频编码的第一算力小于第二算力阈值,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。为提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时采用的第一运动估计参数调整为控制运动搜索的第一最大像素范围为esa,第一亚像素估计复杂度为9级,满足提高第i个视频帧序列的视频编码的算力的要求。第一最大像素范围(esa)大于第二最大像素范围(umh),第一亚像素估计复杂度(9级)大于第二亚像素估计复杂度(8级)。同时,提高对第i个视频帧序列进行视频编码时采用的第一运动补偿参数,使得第一运动补偿参数中的第一搜索范围大于对第i-1个视频帧序列进行视频编码时采用的第二运动补偿参数中的第二搜索范围。
本申请实施例提供的方法,对当前的视频帧序列进行编码时,根据前一个视频帧序列消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、及前一个视频帧序列中第二视频帧图像的第二图像属性,适应性地调整当前的视频帧序列的运动估计参数及运动补偿参数,从而使得调整后的运动估计参数及运动补偿参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图10a和图10b,编码参数包括变换单元划分深度,图10a为变换单元划分深度的确定流程示意图,图10b为变换单元划分深度的确定架构示意图;步骤S150包括子步骤S1541至子步骤S1544。具体的:
S1541、获取第i-1个视频帧序列的第二变换单元划分深度。
S1542、在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,至低于第二变换单元划分深度。
S1543、在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一变换单元划分深度与第i-1个视频帧序列的第二变换单元划分深度相等。
S1544、在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,则根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,至高于第二变换单元划分深度。
可以理解的是,请参阅图11,图11是本申请实施例提供的变换单元深度划分的示意图。对视频帧序列中的视频帧图像进行编码时,将视频帧图像送入编码器,先按照64×64块大小分割成一个个编码树单元CTU,然后对每个CTU进行深度划分,得到编码单元CU。每个编码单元CU包括预测单元PU和变换单元TU。即变换单元为对编码单元进行深度划分得到的更细粒度的单元
对每个CU进行深度划分采用由上向下的划分规则。深度为0时,即depth=0,保持每个CU中变换单元TU的大小与CU的大小相等,例如CU块大小为64×64,则TU的大小也是64×64;CU块大小为32×32,则TU的大小也是32×32;CU块大小为16×16,则
TU的大小也是16×16;CU块大小为8×8,则TU的大小也是8×8。深度为1时,即depth=1,将CU块划分为4个TU块,例如CU块大小为64×64,则一个TU的大小是32×32;CU块大小为32×32,则TU的大小是16×16;CU块大小为16×16,则TU的大小是8×8;CU块大小为8×8,则TU的大小是4×4。由于变换单元的划分深度越大,则对图像进行编码时需要的算力就越大,当需要降低算力时,可以降低变换单元的划分深度。
应理解,上述图11所示的变换单元深度划分示意图仅为示例,即图11中所示的具体数值仅为示例,在本申请实施例中并不具体限定编码单元和变换单元的划分方式。
当满足条件一或条件二之一时,需要降低对第i个视频帧序列进行视频编码时的算力。条件一:第一算力大于第一算力阈值。条件二:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级。通过减少变换单元划分深度,实现降低对第i个视频帧序列进行视频编码时的算力。
关于第一算力阈值、第二算力阈值的解释内容已在上文中介绍,详细可参见上文的相关介绍内容。同样的,关于属性等级的解释内容已在上文中介绍,详细可参见上文的相关介绍内容。
举例说明,对第i-1个视频帧序列进行视频编码时的变换单元划分深度为depth=1,即将每个CU划分为4个TU块,此时对第i-1个视频帧序列进行视频编码的第一算力超过第一算力阈值,需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时的变换单元划分深度调整为depth=0,即将每个CU划分为1个TU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的变换单元划分深度(第一变换单元划分深度depth=0)低于对第i-1个视频帧序列进行视频编码时的变换单元划分深度(第二变换单元划分深度depth=1)。
对第i-1个视频帧序列进行视频编码时的变换单元划分深度为depth=1,即将每个CU划分为4个TU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中的第二视频帧图像的属性等级小于第i个视频帧序列中的第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息升高),需降低第i个视频帧序列的视频编码的算力。为降低第i个视频帧序列的视频编码的算力,则将对第i个视频帧序列进行视频编码时的变换单元划分深度调整为depth=0,即将每个CU划分为1个TU块,满足降低第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的变换单元划分深度(第一变换单元划分深度depth=0)低于对第i-1个视频帧序列进行视频编码时的变换单元划分深度(第二变换单元划分深度depth=1)。
当满足条件三时,无需调整第i个视频帧序列进行视频编码时的算力。条件三:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级。由于无需调整第i个视频帧序列进行视频编码时的算力,所以可以保持第i个视频帧序列的第一变换单元划分深度与第i-1个视频帧序列的第二变换单元划分深度相等。
举例说明,对第i-1个视频帧序列进行视频编码时的变换单元划分深度为depth=1,即将每个CU划分为4个TU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中的第二视频帧图像的属性等级等于第i个视频帧序列中的第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息无变化),无需调整第i个视频帧序列的视频编码的算力。可以保持第i个视频帧序列的第一变换单元划分深度与第i-1个视频帧序列的第二变换单元划分深度相等。即对第i个视频帧序列进行视频编码时的变换单元划分深度为depth=1,即将每个CU划分为4个TU块。
当满足条件四或条件五之一时,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。条件四:第一算力小于第二算力阈值。条件五:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级。通过增加变换单元划分深度,实现增加对第i个视频帧序列进行视频编码时的算力。
举例说明,对第i-1个视频帧序列进行视频编码时的变换单元划分深度为depth=0,即将每个CU划分为1个TU块,此时对第i-1个视频帧序列进行视频编码的第一算力小于第二算力阈值,为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力。为提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时的变换单元划分深度调整为depth=1,即将每个CU划分为4个TU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的变换单元划分深度(第一变换单元划分深度depth=1)高于对第i-1个视频帧序列进行视频编码时的变换单元划分深度(第二变换单元划分深度depth=0)。
对第i-1个视频帧序列进行视频编码时的变换单元划分深度为depth=0,即将每个CU划分为1个TU块,此时对第i-1个视频帧序列进行视频编码的第一算力大于第二算力阈值且小于第一算力阈值。而第i-1个视频帧序列中的第二视频帧图像的属性等级大于第i个视频帧序列中的第一视频帧图像的属性等级(第一视频帧图像的纹理复杂度信息及场景复杂度信息相对于第二视频帧图像的纹理复杂度信息及场景复杂度信息降低),为提高出帧稳定性,可提高对第i个视频帧序列进行视频编码时的算力,则将对第i个视频帧序列进行视频编码时的变换单元划分深度调整为depth=1,即将每个CU划分为4个TU块,满足提高第i个视频帧序列的视频编码的算力的要求。对第i个视频帧序列进行视频编码时的变换单元划分深度(第一变换单元划分深度depth=1)高于对第i-1个视频帧序列进行视频编码时的变换单元划分深度(第二变换单元划分深度depth=0)。
本申请实施例提供的方法,对当前的视频帧序列进行编码时,根据前一个视频帧序列消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性,适应性地调整当前的视频帧序列的变换单元划分深度,从而使得调整后的变换单元划分深度可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图12,第i个视频帧序列的视频编码参数包括第一编码单元划分深度、第一预测单元划分深
度、第一变换单元划分深度、第一最大像素范围、第一亚像素估计复杂度及第一搜索范围。步骤S160包括子步骤S1611至子步骤S1619。具体的:
S1611、从第i个视频帧序列获取目标视频帧图像及目标视频帧图像的目标参考图像。
其中,目标参考图像为目标视频帧图像的前一视频帧图像经过编码后得到的。
S1612、根据第一编码单元划分深度对目标视频帧图像进行编码单元深度划分,得到K个第一编码单元;根据第一预测单元划分深度对K个第一编码单元进行预测单元深度划分,得到K×L个第一预测单元。
其中,K及L均为大于等于1的整数。
S1613、根据第一编码单元划分深度对目标参考图像进行编码单元深度划分,得到K个参考编码单元;根据第一预测单元划分深度对K个参考编码单元进行预测单元深度划分,得到K×L个参考预测单元。
其中,K个第一编码单元与K个参考编码单元具有对应关系,K×L个第一预测单元与K×L个参考预测单元具有对应关系。
S1614、根据第一最大像素范围及第一亚像素估计复杂度,对K×L个第一预测单元及K×L个参考预测单元进行运动估计处理,生成K×L个第一运动估计单元。
S1615、根据第一搜索范围对K×L个第一运动估计单元及K×L个参考预测单元进行运动补偿处理,生成目标帧间预测图像。
其中,目标帧间预测图像包括K×L个目标帧间预测单元。
S1616、根据目标视频帧图像及目标帧间预测图像,生成残差图像。
S1617、根据第一变换单元划分深度对残差图像进行变换单元划分,生成变换图像。
S1618、对变换图像进行量化,生成残差系数。
S1619、将残差系数进行熵编码,生成目标视频帧图像的编码值。
可以理解的是,对第i个视频帧序列进行视频编码需要对第i个视频帧序列中的全部视频帧图像进行编码,本申请实施例以对第i个视频帧序列中的任意一张视频帧图像进行编码为例进行说明。
请参阅图13,图13是本申请实施例提供的对目标视频帧图像进行编码的示意图。从第i个视频帧序列中获取任意一张视频帧图像作为目标视频帧图像,并将目标视频帧图像的前一视频帧图像经过编码后得到的图像作为目标视频帧图像的目标参考图像。将目标视频帧图像送入到编码器,根据第一编码单元划分深度对目标视频帧图像进行编码单元深度划分,得到K个第一编码单元;根据第一预测单元划分深度对K个第一编码单元分别进行预测单元深度划分,得到K×L个第一预测单元。同样的,将目标参考图像送入到编码器,根据第一编码单元划分深度对目标参考图像进行编码单元深度划分,得到K个参考编码单元;根据第一预测单元划分深度对K个参考编码单元分别进行预测单元深度划分,得到K×L个参考预测单元;参考编码单元与第一编码单元具有对应关系,参考预测单元与第一预测单元具有对应关系。
根据第一最大像素范围及第一亚像素估计复杂度,对K×L个第一预测单元及K×L个参考预测单元进行运动估计处理(帧间预测),生成由K×L个第一运动估计单元组成的第
一运动估计图像。将第一运动估计图像(包括K×L个第一运动估计单元)与目标视频帧图像(包括K×L个第一预测单元)进行相减,得到残差图像。根据第一变换单元划分深度对残差图像进行变换单元划分,生成变换图像。对变换图像进行量化,生成残差系数。将残差系数输入熵编码模块进行熵编码,生成目标视频帧图像的编码值。编码值用于表示经过编码后的目标视频帧图像。
本申请实施例提供的方法,根据调整后得到的视频编码参数(包括第一编码单元划分深度、第一预测单元划分深度、第一变换单元划分深度、第一最大像素范围、第一亚像素估计复杂度及第一搜索范围)对第i个视频帧序列中的每个视频帧图像进行编码,得到每个视频帧图像对应的编码值,实现了对第i个视频帧序列的视频编码过程,基于此对目标视频中各个视频帧序列分别进行视频编码,得到目标视频对应的编码视频,在对编码视频进行传输时,可以降低存储空间,提高传输效率,并且在对编码视频进行解码时,提高出帧稳定性。
在本申请的图10对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图14,子步骤S1618之后还包括子步骤S1621至子步骤S1624。具体的:
S1621、对残差系数进行反量化及反变换,生成重构图像残差系数。
S1622、通过重构图像残差系数及目标帧间预测图像,生成重构图像。
S1623、通过去块滤波器对重构图像进行处理,生成第一滤波图像。
其中,去块滤波器用于对重构图像中的垂直边缘进行水平滤波,以及对重构图像中的水平边缘进行垂直滤波。
S1624、通过采样自适应偏移滤波器对第一滤波图像进行处理,生成目标视频帧图像对应的参考图像。
其中,参考图像用于对目标视频帧图像的下一帧图像进行编码,采样自适应偏移滤波器用于对第一滤波图像进行带偏移和边缘偏移。
可以理解的是,对目标视频帧图像进行编码时的参考图像为目标视频帧图像的前一视频帧图像经过编码后得到的图像,对目标视频帧图像的后一帧视频帧图像进行编码时的参考图像为目标视频帧图像经过编码后得到的图像。请参阅图15,图15为本申请实施例提供的编码框架的示意图。残差系数经过反量化及反变换后,生成重构图像残差系数。将重构图像残差系数与目标帧间预测图像相加,得到重构图像。重构图像依次经过去块滤波器及采样自适应偏移滤波器(环内滤波)后,生成目标视频帧图像对应的参考图像。目标视频帧图像对应的参考图像进入参考帧队列,作为下一帧的参考图像,从而依次向后编码。
对目标视频帧图像进行编码时,还包括对目标视频帧图像进行帧内预测的过程。根据目标视频帧图像及重构图像进行帧内预测选择,生成帧内预测选择图像;根据帧内预测选择图像及重构图像进行帧内预测,得到帧内预测图像。
本申请实施例提供的方法,将根据目标视频帧图像生成的参考图像作为下一帧的编码过程中的参考图像,完善了对第i个视频帧序列的视频编码过程,提高出帧稳定性。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,视频编码参数包括处理取消消息;步骤S150进一步包括子步骤S1551。具体的:
S1551、在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据处理取消消息,取消对第i个视频帧序列的去噪处理、锐化处理及时域滤波处理中的一种或多种。
可以理解的是,在进行视频编码时还包括编码其他处理过程,包括前处理过程及后处理过程,其中,前处理包括去噪处理、锐化处理及时域滤波处理等,后处理包括环路滤波及胶片颗粒(AV1 Film Grain)等等,环路滤波包括自适应补偿滤波(Deblocking,DB)、自适应环路滤波(Adaptive loop filter,ALF)、取样自适应偏移(Sample Adaptive Offset,SAO)等等。这些处理会造成一定的服务器算力消耗。
当满足条件一或条件二之一时,需要降低对第i个视频帧序列进行视频编码时的算力。条件一:第一算力大于第一算力阈值。条件二:第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级。通过减少前处理过程和/或后处理过程,实现降低对第i个视频帧序列进行视频编码时的算力。具体可以通过取消去噪处理、锐化处理及时域滤波处理中的一种或多种实现。
本申请实施例提供的方法,对当前的视频帧序列进行编码时,根据前一个视频帧序列消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性,适应性地调整当前的视频帧序列的处理过程,从而使得调整后的编码单元的划分深度可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图16,步骤S120之后还包括步骤S121至步骤S123。具体的:
S121、通过画面场景分类模型,根据第一视频帧图像和第二视频帧图像,确定第一视频帧图像的第一场景复杂度信息及第二视频帧图像的第二场景复杂度信息。
具体的,可以分别将第一视频帧图像和第二视频帧图像输入画面场景分类模型,该画面场景分类模型通过对输入的第一视频帧图像和第二视频帧图像进行分析处理,将输出该第一视频帧图像的第一场景复杂度信息及该第二视频帧图像的第二场景复杂度信息。
可以理解的是,画面场景分类模型为基于深度学习和神经网络的特征提取分类模型,对画面场景进行识别及分类,将画面场景分为简单场景、一般场景、中等场景及复杂场景。模型中的算法包括CNN、决策树,随机森林等。
以深度学习为例,利用深度学习CNN、CNN+RNN等算法框架对视频帧图像进行识别,用视频帧图像做深度学习相关模型时的训练样本,当训练样本足够情况下,画面特征比较明显的场景(例如游戏场景、足球场景、蓝球场景、动漫场景等),通过纯CNN网络(卷积层、过滤、池化层等处理)模型进行场景识别的准确率可达到99%以上;画面特征比较分散的场景(例如电视剧场景、户外运动场景、美食场景、旅游场景等),通过CNN结合RNN+LSTM做时域+频域分析进行场景识别的准确率可达到90%左右。
S122、通过画面纹理分类模型,根据第一视频帧图像和第二视频帧图像,确定第一视频帧图像的第一纹理复杂度信息及第二视频帧图像的第二纹理复杂度信息。
具体的,可以分别将第一视频帧图像和第二视频帧图像输入画面纹理分类模型,该画面纹理分类模型通过对输入的第一视频帧图像和第二视频帧图像进行分析处理,将输出该第一视频帧图像的第一纹理复杂度信息及该第二视频帧图像的第二纹理复杂度信息。
可以理解的是,纹理复杂度信息包括简单纹理、一般纹理、中等纹理及复杂纹理。纹理复杂度信息分析方法有:欧几里得距离、统计直方图、LBP检测算法以及CNN特征提取分类等算法,基于边缘特征的画面复杂度估计方法,常用的算法有canny、sobel、robert等边缘检测算子算法等,本申请在此不做限制。
以LBP检测算法为例:LBP指局部二值模式,是一种用来描述图像局部特征的算子,LBP特征具有灰度不变性和旋转不变性等显著优点。原始的LBP算子定义为在3×3的窗口内,以窗口中心像素为阈值,将相邻的8个像素的灰度值与其进行比较,若周围像素值大于中心像素值,则该像素点的位置被标记为1,否则为0。这样,3×3邻域内的8个像素经比较可产生8位二进制数(通常转换为十进制数即LBP码,共256种),即得到该窗口中心像素点的LBP值,并用这个值来反映该区域的纹理信息,检测时可以根据视频画面分辨率和算力大小自动调整检测区域,或通过下采样调整画面分辨率大小。综合画面检测的区域LBP值汇总计算,如LBP值80%都集中50以下的为简单纹理,50-100的为一般纹理,100-150的为中等纹理,大于150的为复杂纹理。
S123、根据第一场景复杂度信息及第一纹理复杂度信息生成第一图像属性,根据第二场景复杂度信息及第二纹理复杂度信息生成第二图像属性。
可以理解的是,将场景复杂度信息及纹理复杂度信息作为图像属性,通过图像属性表征视频帧图像的复杂情况,以匹配编码参数。
本申请实施例提供的方法,通过使用画面场景分类模型确定视频帧图像的场景复杂度信息,使用画面纹理分类模型确定视频帧图像的纹理复杂度信息,进而,根据所确定的场景复杂度信息和纹理复杂度信息,确定图像属性,由此保证所确定的图像属性的准确性和可靠性,进而有利于后续根据该图像属性来准确地调整视频编码参数。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图17,步骤S160之后还包括步骤S161至步骤S165。具体的:
S161、计算对第i个视频帧序列进行编码时消耗的算力,得到第二算力。
S162、从N个视频帧序列中,获取第i+1个视频帧序列。
其中,第i个视频帧序列与第i+1个视频帧序列在目标视频中相邻。
S163、从第i+1个视频帧序列中获取第三视频帧图像。
其中,第三视频帧图像对应第三图像属性。
S164、根据第二算力、第一图像属性及第三图像属性中的至少一项,确定第i+1个视频帧序列的编码参数。
S165、根据第i+1个视频帧序列的视频编码参数对第i+1个视频帧序列进行编码,得到第i+1个编码视频段。
可以理解的是,上述步骤S161至步骤S165为对第i+1个视频帧序列进行编码的过程。在完成第i个视频帧序列的编码后,计算出对第i个视频帧序列进行编码消耗的算力。获
取第i+1个视频帧序列中的第三视频帧图像,第三视频帧图像可以为第i+1个视频帧序列中的IDR帧,第三视频帧图像对应第三图像属性,第三图像属性用于表征该第三视频帧图像的纹理复杂度信息及场景复杂度信息。基于对第i个视频帧序列进行视频编码时消耗的第二算力、第一图像属性及第三图像属性,在第i个视频帧序列的视频编码参数基础上调整确定第i+1个视频帧序列的视频编码参数。根据第i+1个视频帧序列的视频编码参数对第i+1个视频帧序列进行编码具体是指,根据第i+1个视频帧序列的视频编码参数对第i+1个视频帧序列中的全部视频帧图像进行编码,得到每个视频帧图像的编码图像,由全部的编码图像构成第i+1个编码视频段。
本申请实施例提供的视频处理方法,对输入视频进行视频编码时,若输入视频中的场景唯一且固定,则根据该场景匹配视频编码参数,通过匹配的视频编码参数对该输入视频进行编码,使得该编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。若输入视频中包括多个场景,则根据场景变化将输入视频进行切分得到N个视频帧序列,将对输入视频的编码任务分解为对输入视频包含的N个视频帧序列分别进行编码,对当前的视频帧序列进行编码时,根据编码前一个视频帧序列时消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,适应性第调整当前的视频帧序列的视频编码参数,使得调整后的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图2对应的实施例提供的视频处理方法的一个可选实施例中,请参阅图18,步骤S110包括子步骤S1101至子步骤S1103。具体的:
S1101、获取输入视频。
S1102、通过场景识别模型,对输入视频进行场景识别,得到N个场景。
其中,场景识别模型用于对输入视频中出现的场景进行识别。
S1103、根据N个场景对输入视频进行分割,得到N个视频片段。
可以理解的是,通过训练好的场景识别模型,识别出输入视频中的N个场景,以场景切换作为视频分割点对输入视频进行切分,得到N个子视频段,将子视频段以视频帧序列的方式进行表示,每个视频帧序列包括该子视频段中的全部帧图像。场景识别模型可以是基于深度学习和神经网络的特征提取分类模型,常用的算法有CNN、决策树,随机森林等。以深度学习为例,利用深度学习CNN、CNN+RNN等算法框架对视频帧生成图片,用视频帧生成的图片做深度学习相关模型时的训练样本。
举例说明,输入视频中包含3个场景:会议场景、剧场场景及游泳场景,其中,会议场景的场景复杂度为简单,剧场场景的场景复杂度为中等,游泳场景的场景复杂度为复杂。通过场景识别模型对输入视频进行分割的方式为:将输入视频作为训练好的场景识别模型的输入,场景识别模型识别出该输入视频包含3个场景,输出3个场景对应的3个子视频段,并将每个子视频段以视频帧序列的方式进行表示。
本申请实施例提供的视频处理方法,通过场景识别模型识别输入视频中包括的场景,并根据输入视频中包括的场景对该输入视频进行分割,得到N个视频帧序列,如此保证视
频帧序列分割的合理性,每个视频帧序列中包括的视频帧图像均对应相同的场景,相应地对每个视频帧序列进行视频编码时消耗的算力不会出现大幅波动,即保证每个视频帧序列对应的视频编码参数可以较好地适用于编码该视频帧序列中各视频帧图像。
为了便于理解,下面将结合图19介绍一种视频处理方法。图19为本申请实施例提供的视频处理方法流程示意图。
首先,对输入视频进行编码时,根据输入视频中的场景变化,将输入视频进行切分,得到N个视频帧序列(GOP),以GOP为最小粒度,调整每个GOP的编码参数。
其次,从N个视频帧序列中,获取第i个视频帧序列及第i-1个视频帧序列。其中,第i个视频帧序列与第i-1个视频帧序列在输入视频中相邻。
再次,从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像。其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性,图像属性用于表征视频帧图像的纹理复杂度信息及场景复杂度信息。第一视频帧图可以为第i个视频帧序列中的IDR帧,第二视频帧图像可以为第i-1个视频帧序列中IDR帧。每个视频帧图像均对应图像属性,图像属性用于表征视频帧图像的纹理复杂度信息及场景复杂度信息。
接着,获取第i-1个视频帧序列对应的第一算力。其中,第一算力用于表征对第i-1个视频帧序列进行视频编码时消耗的服务器算力。
然后,基于第i-1个视频帧序列对应的第一算力、第一图像属性及第二图像属性,在第i-1个视频帧序列的视频编码参数基础上,调整第i个视频帧序列的视频编码参数:
若第一算力大于第一算力阈值,则下调第i个视频帧序列的视频编码参数。第一算力阈值是指服务器为视频编码提供的一种算力阈值。例如,服务器为视频编码提供的第一算力阈值为1000,而第i-1个视频帧序列对应的第一算力超过1000,则在对第i个视频帧序列进行视频编码时需要降低算力,为降低第i个视频帧序列进行视频编码时的算力,则需要下调第i个视频帧序列的视频编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级,则下调第i个视频帧序列的视频编码参数,其中,第一算力阈值大于第二算力阈值。例如,服务器为视频编码提供的第一算力阈值为1000,第二算力阈值为800,第i-1个视频帧序列的视频编码的第一算力为900,第一算力大于第二算力阈值且小于第一算力阈值。由于属性等级越大(纹理复杂度越高和/或场景复杂度越高),需要的算力越高。第一图像属性的属性等级大于第二图像属性的属性等级表示,对第i个视频帧序列进行视频编码时需要的算力会大于对第i-1个视频帧序列进行视频编码时的算力,而需要降低对第i个视频帧序列进行视频编码时的算力,为降低第i个视频帧序列进行视频编码时的算力,则需要下调第i个视频帧序列的视频编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级,则保持第i个视频帧序列的视频编码参数与第i-1个视频帧序列的视频编码参数相同。例如,服务器为视频编码提供的第一算力阈值为1000,第二算力阈值为800,第i-1个视频帧序列的视频编码的第一算力为900,第一算力大于第二算力
阈值且小于第一算力阈值。第一图像属性的属性等级等于第二图像属性的属性等级表示对第i个视频帧序列进行视频编码时需要的算力会等于或接近对第i-1个视频帧序列进行视频编码时的算力,而无需调整第i个视频帧序列的视频编码参数。
若第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级,则上调第i个视频帧序列的视频编码参数。例如,服务器为视频编码提供的第一算力阈值为1000,第二算力阈值为800,第i-1个视频帧序列的视频编码的第一算力为900,第一算力大于第二算力阈值且小于第一算力阈值。由于属性等级越大(纹理复杂度越高和/或场景复杂度越高),需要的算力越高。第一图像属性的属性等级小于第二图像属性的属性等级,表示对第i个视频帧序列进行视频编码时需要的算力会小于对第i-1个视频帧序列进行视频编码时的算力,为提高编码质量,可以上调第i个视频帧序列的视频编码参数。
若第一算力小于第二算力阈值,则上调第i个视频帧序列的视频编码参数。例如,服务器为视频编码提供的第二算力阈值为800,第i-1个视频帧序列的视频编码的第一算力为700,第一算力小于第二算力阈值。为提高编码质量,可以上调第i个视频帧序列的视频编码参数。
最后,根据第i个视频帧序列的视频编码参数,对第i个视频帧序列中的全部视频帧图像进行编码,得到每个视频帧图像的编码图像,由全部的编码图像构成第i个编码视频段。
调整第i个视频帧序列的视频编码参数的顺序为:
1)调整编码单元划分深度。
获取第i-1个视频帧序列的第二编码单元划分深度。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度。其中,第一编码单元划分深度低于第二编码单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一编码单元划分深度与第i-1个视频帧序列的第二编码单元划分深度相等。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度。其中,第一编码单元划分深度高于第二编码单元划分深度。
2)调整运动估计参数及运动补偿参数。
获取第i-1个视频帧序列的第二运动估计参数及第二运动补偿参数。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,根据第二运动补偿参数调整
第i个视频帧序列的第一运动补偿参数。其中,第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定;第一最大像素范围小于第二最大像素范围,第一亚像素估计复杂度小于第二亚像素估计复杂度。第一运动补偿参数通过第一搜索范围确定,第二运动补偿参数通过第二搜索范围确定;第一搜索范围小于第二搜索范围。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一运动估计参数与第i-1个视频帧序列的第二运动估计参数相等,以及保持第i个视频帧序列的第一运动补偿参数与第i-1个视频帧序列的第二运动补偿参数相等。其中,第一最大像素范围等于第二最大像素范围,第一亚像素估计复杂度等于第二亚像素估计复杂度。第一搜索范围等于第二搜索范围。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,以及根据第二运动补偿参数调整第i个视频帧序列的第一运动补偿参数。其中,第一最大像素范围大于第二最大像素范围,第一亚像素估计复杂度大于第二亚像素估计复杂度。第一搜索范围大于第二搜索范围。
3)调整变换单元划分深度。
获取第i-1个视频帧序列的第二变换单元划分深度。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,其中,第一变换单元划分深度低于第二变换单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一变换单元划分深度与第i-1个视频帧序列的第二变换单元划分深度相等。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,其中,第一变换单元划分深度高于第二变换单元划分深度。
4)调整预测单元划分深度。
获取第i-1个视频帧序列的第二预测单元划分深度。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,其中,第一预测单元划分深度低于第二预测单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于
第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一预测单元划分深度与第i-1个视频帧序列的第二预测单元划分深度相等。
在一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,其中,第一预测单元划分深度高于第二预测单元划分深度。
5)调整视频编码中的其他过程。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,取消去噪处理、锐化处理及时域滤波处理中的一种或多种。
本申请实施例提供的方案,通过在视频编码预分析过程中自动分析视频画面场景纹理复杂性和画面场景分类,根据视频编码画面场景切换检测,以及视频画面纹理以及画面场景分类评估,把编码单元划分、MC、ME、变换、预处理、lookahead等算力消耗比较大的流程自适应裁剪,在损失一定视频画面bd-rate的情况下,保持视频编码算力均衡的一种编码内核算力消耗平滑解决方案。通过该解决方案在控制损失一定视频bd-rate情况下,服务器算力有比较平滑的控制,提升了服务器机器算力负载5-10个点,很好地节省了视频云媒体处理转码成本,帮助视频云用户媒体处理转码降本增效。
下面对本申请中的视频处理装置进行详细描述,请参阅图20。图20为本申请实施例中视频处理装置10的一个实施例示意图,视频处理装置10包括:
视频帧序列获取模块110,用于获取输入视频的N个视频帧序列。其中,每个视频帧序列包括至少一个视频帧图像,N为大于1的整数。
视频帧序列提取模块120,用于从N个视频帧序列中,获取第i个视频帧序列及第i-1个视频帧序列。其中,第i个视频帧序列与第i-1个视频帧序列在目标视频中相邻,i为大于1的整数。
视频帧图像获取模块130,用于从第i个视频帧序列中获取第一视频帧图像,从第i-1个视频帧序列中获取第二视频帧图像。其中,第一视频帧图像对应第一图像属性,第二视频帧图像对应第二图像属性。
算力获取模块140,用于获取第i-1个视频帧序列对应的第一算力。其中,第一算力用于表征对第i-1个视频帧序列进行编码和/或解码时消耗的算力。
编码参数确定模块150,用于根据第一算力、第一图像属性及第二图像属性中的至少一项,确定第i个视频帧序列的编码参数。
可选的,视频处理装置10还包括:
视频帧序列编码模块160,用于根据第i个视频帧序列的编码参数对第i个视频帧序列进行编码,得到第i个编码视频段。
本申请实施例提供的视频处理装置,对输入视频进行视频编码时,若输入视频中的场景唯一且固定,则根据该场景匹配视频编码参数,通过匹配的视频编码参数对该输入视频进行编码,使得该编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,
且降低服务器部署成本。若输入视频中包括多个场景,则根据场景变化将输入视频进行切分得到N个视频帧序列,将对输入视频的编码任务分解为对输入视频包含的N个视频帧序列分别进行编码,对当前的视频帧序列进行编码时,根据编码前一个视频帧序列时消耗的服务器算力、当前的视频帧序列中第一视频帧图像的第一图像属性、以及前一个视频帧序列中第二视频帧图像的第二图像属性中的至少一项,适应性地调整当前的视频帧序列的编码参数,使得调整后的编码参数可以满足对当前的视频帧序列的编码需求,提高出帧稳定性,且降低服务器部署成本。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,请参阅图21,编码参数包括编码单元划分深度。编码参数确定模块150包括编码单元划分深度调整子模块151,编码单元划分深度调整子模块151用于:
获取第i-1个视频帧序列的第二编码单元划分深度;
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度,至低于第二编码单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一编码单元划分深度与第i-1个视频帧序列的第二编码单元划分深度相等。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二编码单元划分深度,调整第i个视频帧序列的第一编码单元划分深度,至高于第二编码单元划分深度。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,请参阅图22,编码参数包括预测单元划分深度。编码参数确定模块150包括预测单元划分深度调整子模块152,预测单元划分深度调整子模块152用于:
获取第i-1个视频帧序列的第二预测单元划分深度。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,至低于第二预测单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一预测单元划分深度与第i-1个视频帧序列的第二预测单元划分深度相等。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二预测单元划分深度,调整第i个视频帧序列的第一预测单元划分深度,至高于第二预测单元划分深度。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,请参阅图23,编码参数包括运动估计参数及运动补偿参数。编码参数确定模块150包括运动估计参数及运动补偿参数调整子模块153,运动估计参数及运动补偿参数调整子模块153用于:
获取第i-1个视频帧序列的第二运动估计参数及第二运动补偿参数。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,根据第二运动补偿参数调整第i个视频帧序列的第一运动补偿参数。
其中,第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定;第一最大像素范围小于第二最大像素范围,第一亚像素估计复杂度小于第二亚像素估计复杂度。第一运动补偿参数通过第一搜索范围确定,第二运动补偿参数通过第二搜索范围确定;第一搜索范围小于第二搜索范围。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一运动估计参数与第i-1个视频帧序列的第二运动估计参数相等,以及保持第i个视频帧序列的第一运动补偿参数与第i-1个视频帧序列的第二运动补偿参数相等。
其中,第一最大像素范围等于第二最大像素范围,第一亚像素估计复杂度等于第二亚像素估计复杂度。第一搜索范围等于第二搜索范围。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二运动估计参数调整第i个视频帧序列的第一运动估计参数,以及根据第二运动补偿参数调整第i个视频帧序列的第一运动补偿参数。
其中,第一最大像素范围大于第二最大像素范围,第一亚像素估计复杂度大于第二亚像素估计复杂度。第一搜索范围大于第二搜索范围。
本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,请参阅图24,编码参数包括变换单元划分深度;编码参数确定模块150包括变换单元划分深度调整子模块154,变换单元划分深度调整子模块154用于:
获取第i-1个视频帧序列的第二变换单元划分深度。
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,至低于第二变换单元划分深度。
在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级等于第二图像属性的属性等级的情况下,保持第i个视频帧序列的第一变换单元划分深度与第i-1个视频帧序列的第二变换单元划分深度相等。
在第一算力小于第二算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第
一算力阈值,且第一图像属性的属性等级小于第二图像属性的属性等级的情况下,根据第二变换单元划分深度,调整第i个视频帧序列的第一变换单元划分深度,至高于第二变换单元划分深度。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,第i个视频帧序列的视频编码参数包括第一编码单元划分深度、第一预测单元划分深度、第一变换单元划分深度、第一最大像素范围、第一亚像素估计复杂度及第一搜索范围;视频帧序列编码模块160具体用于:
从第i个视频帧序列中,获取目标视频帧图像及目标视频帧图像的目标参考图像,其中,目标参考图像为目标视频帧图像的前一视频帧图像经过编码后得到的;
根据第一编码单元划分深度对目标视频帧图像进行编码单元深度划分,得到K个第一编码单元,其中,K为大于等于1的整数;
根据第一预测单元划分深度对K个第一编码单元进行预测单元深度划分,得到K×L个第一预测单元,其中,L为大于等于1的整数;
根据第一编码单元划分深度对目标参考图像进行编码单元深度划分,得到K个参考编码单元,其中,K个第一编码单元与K个参考编码单元具有对应关系;
根据第一预测单元划分深度对K个参考编码单元进行预测单元深度划分,得到K×L个参考预测单元,其中,K×L个第一预测单元与K×L个参考预测单元具有对应关系;
根据第一最大像素范围及第一亚像素估计复杂度,对K×L个第一预测单元及K×L个参考预测单元进行运动估计处理,生成K×L个第一运动估计单元;
根据第一搜索范围对K×L个第一运动估计单元及K×L个参考预测单元进行运动补偿处理,生成目标帧间预测图像;
根据目标视频帧图像及目标帧间预测图像,生成残差图像;
根据第一变换单元划分深度对残差图像进行变换单元划分,生成变换图像;
对变换图像进行量化,生成残差系数;
将残差系数进行熵编码,生成目标视频帧图像的编码值。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,视频帧序列编码模块160还用于:
对残差系数进行反量化及反变换,生成重构图像残差系数;
通过重构图像残差系数及目标帧间预测图像,生成重构图像;
通过去块滤波器对重构图像进行处理,生成第一滤波图像,其中,去块滤波器用于对重构图像中的垂直边缘进行水平滤波,以及对重构图像中的水平边缘进行垂直滤波;
通过采样自适应偏移滤波器对第一滤波图像进行处理,生成目标视频帧图像对应的参考图像,其中,参考图像用于对目标视频帧图像的下一帧图像进行编码,采样自适应偏移滤波器用于对第一滤波图像进行带偏移和边缘偏移。
本申请实施例提供的装置,根据目标视频帧图像生成的参考图像作为下一帧的编码过程中的参考图像,完善了对第i个视频帧序列的视频编码过程,提高出帧稳定性。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,请参阅图
25,编码参数确定模块150包括处理取消消息子模块155,处理取消消息子模块155用于:
在第一算力大于第一算力阈值的情况下,或者在第一算力大于第二算力阈值且小于第一算力阈值,且第一图像属性的属性等级大于第二图像属性的属性等级的情况下,根据处理取消消息,取消对第i个视频帧序列的去噪处理、锐化处理及时域滤波处理中的一种或多种。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,视频帧图像获取模块120还用于:
通过画面场景分类模型,根据第一视频帧图像及第二视频帧图像,确定第一视频帧图像的第一场景复杂度信息及第二视频帧图像的第二场景复杂度信息;
通过画面纹理分类模型,根据第一视频帧图像及第二视频帧图像,确定第一视频帧图像的第一纹理复杂度信息及第二视频帧图像的第二纹理复杂度信息;
根据第一场景复杂度信息及第一纹理复杂度信息生成第一图像属性,根据第二场景复杂度信息及第二纹理复杂度信息生成第二图像属性。
在本申请的图18对应的实施例提供的视频处理装置的一个可选实施例中,视频帧序列编码模块160,还用于:
计算对第i个视频帧序列进行编码时消耗的算力,得到第二算力;
从N个视频帧序列中,获取第i+1个视频帧序列,其中,第i个视频帧序列与第i+1个视频帧序列在目标视频中相邻;
从第i+1个视频帧序列中获取第三视频帧图像,其中,第三视频帧图像对应第三图像属性;
根据第二算力、第一图像属性及第三图像属性中的至少一项,确定第i+1个视频帧序列的编码参数。
在本申请的图20对应的实施例提供的视频处理装置的一个可选实施例中,视频帧序列获取模块110具体用于:
获取输入视频;
通过场景识别模型,对输入视频进行场景识别,得到N个场景,其中,场景识别模型用于对输入视频中出现的场景进行识别;
根据N个场景对输入视频进行分割,得到N个视频片段。
图26是本申请实施例提供的一种服务器结构示意图,该服务器300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,一个或一个以上存储应用程序342或数据344的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器332和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储介质330通信,在服务器300上执行存储介质330中的一系列指令操作。
服务器300还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接
口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图26所示的服务器结构。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (21)
- 一种视频处理方法,由计算机设备执行,包括:获取输入视频的N个视频帧序列,其中,每个所述视频帧序列包括至少一个视频帧图像,N为大于1的整数;从所述N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列,i为大于1的整数;从所述第i个视频帧序列中获取第一视频帧图像,从所述第i-1个视频帧序列中获取第二视频帧图像,其中,所述第一视频帧图像对应第一图像属性,所述第二视频帧图像对应第二图像属性;获取所述第i-1个视频帧序列对应的第一算力,其中,所述第一算力用于表征对所述第i-1个视频帧序列进行编码和/或解码时需要的算力;根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数。
- 如权利要求1所述的视频处理方法,所述编码参数包括编码单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二编码单元划分深度;在所述第一算力大于第一算力阈值的情况下,或者在所述第一算力大于第二算力阈值且小于所述第一算力阈值,且所述第一图像属性的属性等级高于所述第二图像属性的属性等级的情况下,根据所述第二编码单元划分深度,调整所述第i个视频帧序列的第一编码单元划分深度,至低于所述第二编码单元划分深度。
- 如权利要求1或2所述的视频处理方法,所述编码参数包括预测单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二预测单元划分深度;在所述第一算力大于第一算力阈值的情况下,或者在所述第一算力大于第二算力阈值且小于所述第一算力阈值,且所述第一图像属性的属性等级高于所述第二图像属性的属性等级的情况下,根据所述第二预测单元划分深度,调整所述第i个视频帧序列的第一预测单元划分深度,至低于所述第二预测单元划分深度。
- 如权利要求1至3任一项所述的视频处理方法,所述编码参数包括运动估计参数及运动补偿参数;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二运动估计参数及第二运动补偿参数;在所述第一算力大于第一算力阈值的情况下,或者在所述第一算力大于第二算力阈值且小于所述第一算力阈值,且所述第一图像属性的属性等级高于所述第二图像属性的属性等级的情况下,根据所述第二运动估计参数调整所述第i个视频帧序列的第一运动估计参 数,根据所述第二运动补偿参数调整所述第i个视频帧序列的第一运动补偿参数;其中,所述第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,所述第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定,所述第一最大像素范围小于所述第二最大像素范围,所述第一亚像素估计复杂度小于所述第二亚像素估计复杂度;所述第一运动补偿参数通过第一搜索范围确定,所述第二运动补偿参数通过第二搜索范围确定,所述第一搜索范围小于所述第二搜索范围。
- 如权利要求1至4任一项所述的视频处理方法,所述编码参数包括变换单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二变换单元划分深度;在所述第一算力大于第一算力阈值的情况下,或者在所述第一算力大于第二算力阈值且小于所述第一算力阈值,且所述第一图像属性的属性等级大于所述第二图像属性的属性等级的情况下,根据所述第二变换单元划分深度,调整所述第i个视频帧序列的第一变换单元划分深度,至低于所述第二变换单元划分深度。
- 如权利要求1至5任一项所述的视频处理方法,所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:在所述第一算力大于第二算力阈值且小于第一算力阈值,且所述第一图像属性的属性等级等于所述第二图像属性的属性等级的情况下,保持所述第i个视频帧序列的编码参数与所述第i-1个视频帧序列的编码参数相同。
- 如权利要求1至6任一项所述的视频处理方法,所述编码参数包括编码单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二编码单元划分深度;在所述第一算力小于第二算力阈值的情况下,或者在所述第一算力大于所述第二算力阈值且小于第一算力阈值,且所述第一图像属性的属性等级小于所述第二图像属性的属性等级的情况下,根据所述第二编码单元划分深度,调整所述第i个视频帧序列的第一编码单元划分深度,至高于所述第二编码单元划分深度。
- 如权利要求1至7任一项所述的视频处理方法,所述编码参数包括预测单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二预测单元划分深度;在所述第一算力小于第二算力阈值的情况下,或者在所述第一算力大于所述第二算力 阈值且小于第一算力阈值,且所述第一图像属性的属性等级小于所述第二图像属性的属性等级的情况下,根据所述第二预测单元划分深度,调整所述第i个视频帧序列的第一预测单元划分深度,至高于所述第二预测单元划分深度。
- 如权利要求1至8任一项所述的视频处理方法,所述编码参数包括运动估计参数及运动补偿参数;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二运动估计参数及第二运动补偿参数;在所述第一算力小于第二算力阈值的情况下,或者在所述第一算力大于所述第二算力阈值且小于第一算力阈值,且所述第一图像属性的属性等级小于所述第二图像属性的属性等级的情况下,根据所述第二运动估计参数调整所述第i个视频帧序列的第一运动估计参数,以及根据所述第二运动补偿参数调整所述第i个视频帧序列的第一运动补偿参数;其中,所述第一运动估计参数通过控制运动搜索的第一最大像素范围及第一亚像素估计复杂度确定,所述第二运动估计参数通过控制运动搜索的第二最大像素范围及第二亚像素估计复杂度确定,所述第一最大像素范围大于所述第二最大像素范围,所述第一亚像素估计复杂度大于所述第二亚像素估计复杂度;所述第一运动补偿参数通过第一搜索范围确定,所述第二运动补偿参数通过第二搜索范围确定,所述第一搜索范围大于所述第二搜索范围。
- 如权利要求1至9任一项所述的视频处理方法,所述编码参数包括变换单元划分深度;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:获取所述第i-1个视频帧序列的第二变换单元划分深度;在所述第一算力小于第二算力阈值的情况下,或者在所述第一算力大于所述第二算力阈值且小于第一算力阈值,且所述第一图像属性的属性等级小于所述第二图像属性的属性等级的情况下,根据所述第二变换单元划分深度,调整所述第i个视频帧序列的第一变换单元划分深度,至高于所述第二变换单元划分深度。
- 如权利要求1至10任一项所述的视频处理方法,所述方法还包括:根据所述第i个视频帧序列的编码参数对所述第i个视频帧序列进行编码,得到第i个编码视频段。
- 如权利要求11所述的视频处理方法,所述第i个视频帧序列的编码参数包括第一编码单元划分深度、第一预测单元划分深度、第一变换单元划分深度、第一最大像素范围、第一亚像素估计复杂度及第一搜索范围;所述根据所述第i个视频帧序列的编码参数对所述第i个视频帧序列进行编码,包括:从所述第i个视频帧序列中,获取目标视频帧图像及所述目标视频帧图像的目标参考图像,其中,所述目标参考图像为所述目标视频帧图像的前一视频帧图像经过编码后得到的;根据所述第一编码单元划分深度对所述目标视频帧图像进行编码单元深度划分,得到K个第一编码单元,其中,K为大于等于1的整数;根据所述第一预测单元划分深度对K个所述第一编码单元进行预测单元深度划分,得到K×L个第一预测单元,其中,L为大于等于1的整数;根据所述第一编码单元划分深度对所述目标参考图像进行编码单元深度划分,得到K个参考编码单元,其中,所述K个第一编码单元与所述K个参考编码单元具有对应关系;根据所述第一预测单元划分深度对K个所述参考编码单元进行预测单元深度划分,得到K×L个参考预测单元,其中,所述K×L个第一预测单元与所述K×L个参考预测单元具有对应关系;根据所述第一最大像素范围及所述第一亚像素估计复杂度,对所述K×L个第一预测单元及所述K×L个参考预测单元进行运动估计处理,生成K×L个第一运动估计单元;根据所述第一搜索范围对所述K×L个第一运动估计单元及所述K×L个参考预测单元进行运动补偿处理,生成目标帧间预测图像;根据所述目标视频帧图像及所述目标帧间预测图像,生成残差图像;根据所述第一变换单元划分深度对所述残差图像进行变换单元划分,生成变换图像;对所述变换图像进行量化,生成残差系数;将所述残差系数进行熵编码,生成所述目标视频帧图像的编码值。
- 如权利要求12所述的视频处理方法,所述生成残差系数之后,还包括:对所述残差系数进行反量化及反变换,生成重构图像残差系数;通过所述重构图像残差系数及所述目标帧间预测图像,生成重构图像;通过去块滤波器对所述重构图像进行处理,生成第一滤波图像,其中,所述去块滤波器用于对所述重构图像中的垂直边缘进行水平滤波,以及对所述重构图像中的水平边缘进行垂直滤波;通过采样自适应偏移滤波器对所述第一滤波图像进行处理,生成所述目标视频帧图像对应的参考图像,其中,所述参考图像用于对所述目标视频帧图像的下一帧图像进行编码,所述采样自适应偏移滤波器用于对第一滤波图像进行带偏移和边缘偏移。
- 如权利要求1至13任一项所述的视频处理方法,所述编码参数包括处理取消消息;所述根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的编码参数,包括:在所述第一算力大于第一算力阈值的情况下,或者在所述第一算力大于第二算力阈值且小于所述第一算力阈值,且所述第一图像属性的属性等级大于所述第二图像属性的属性等级的情况下,根据所述处理取消消息,取消对所述第i个视频帧序列的去噪处理、锐化处理及时域滤波处理中的一种或多种。
- 如权利要求1至14任一项所述的视频处理方法,所述从所述第i个视频帧序列中获取第一视频帧图像,从所述第i-1个视频帧序列中获取第二视频帧图像之后,还包括:通过画面场景分类模型,根据所述第一视频帧图像及所述第二视频帧图像,确定所述 第一视频帧图像的第一场景复杂度信息及所述第二视频帧图像的第二场景复杂度信息;通过画面纹理分类模型,根据所述第一视频帧图像及所述第二视频帧图像,确定所述第一视频帧图像的第一纹理复杂度信息及所述第二视频帧图像的第二纹理复杂度信息;根据所述第一场景复杂度信息及所述第一纹理复杂度信息生成所述第一图像属性,根据所述第二场景复杂度信息及所述第二纹理复杂度信息生成所述第二图像属性。
- 如权利要求11至15任一项所述的视频处理方法,所述得到第i个编码视频段之后,还包括:计算对所述第i个视频帧序列进行编码时消耗的算力,得到第二算力;从所述N个视频帧序列中,获取第i+1个视频帧序列,其中,所述第i个视频帧序列与所述第i+1个视频帧序列在所述目标视频中相邻;从所述第i+1个视频帧序列中获取第三视频帧图像,其中,所述第三视频帧图像对应第三图像属性;根据所述第二算力、所述第一图像属性及所述第三图像属性中的至少一项,确定所述第i+1个视频帧序列的编码参数。
- 如权利要求1至16任一项所述的视频处理方法,所述获取输入视频的N个视频帧序列,包括:获取所述输入视频;通过场景识别模型,对所述输入视频进行场景识别,得到N个场景,其中,所述场景识别模型用于对所述输入视频中出现的场景进行识别;根据所述N个场景对所述输入视频进行分割,得到所述N个视频片段。
- 一种视频处理装置,包括:视频帧序列获取模块,用于获取输入视频的N个视频帧序列,其中,每个所述视频帧序列包括至少一个视频帧图像,N为大于1的整数;视频帧序列提取模块,用于从所述N个视频帧序列中,获取第i个视频帧序列及与其相邻的第i-1个视频帧序列,i为大于1的整数;视频帧图像获取模块,用于从所述第i个视频帧序列中获取第一视频帧图像,从所述第i-1个视频帧序列中获取第二视频帧图像,其中,所述第一视频帧图像对应第一图像属性,所述第二视频帧图像对应第二图像属性;算力获取模块,用于获取所述第i-1个视频帧序列对应的第一算力,其中,所述第一算力用于表征对所述第i-1个视频帧序列进行编码和/或解码时消耗的算力;编码参数确定模块,用于根据所述第一算力、所述第一图像属性及所述第二图像属性中的至少一项,确定所述第i个视频帧序列的视频编码参数。
- 一种计算机设备,包括:存储器、收发器、处理器以及总线系统;其中,所述存储器用于存储程序;所述处理器用于执行所述存储器中的程序,包括执行如权利要求1至17中任一项所述的视频处理方法;所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器 进行通信。
- 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至17中任一项所述的视频处理方法。
- 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行如权利要求1至17中任一项所述的视频处理方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP23878963.0A EP4447441A1 (en) | 2022-10-19 | 2023-10-08 | Video processing method and related device |
US18/652,316 US20240291995A1 (en) | 2022-10-19 | 2024-05-01 | Video processing method and related apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211282417.2 | 2022-10-19 | ||
CN202211282417.2A CN117956158A (zh) | 2022-10-19 | 2022-10-19 | 一种视频处理方法及相关装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/652,316 Continuation US20240291995A1 (en) | 2022-10-19 | 2024-05-01 | Video processing method and related apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024082971A1 true WO2024082971A1 (zh) | 2024-04-25 |
Family
ID=90736858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/123349 WO2024082971A1 (zh) | 2022-10-19 | 2023-10-08 | 一种视频处理方法及相关装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240291995A1 (zh) |
EP (1) | EP4447441A1 (zh) |
CN (1) | CN117956158A (zh) |
WO (1) | WO2024082971A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111669589A (zh) * | 2020-06-23 | 2020-09-15 | 腾讯科技(深圳)有限公司 | 图像编码方法、装置、计算机设备以及存储介质 |
CN111711859A (zh) * | 2020-06-28 | 2020-09-25 | 北京奇艺世纪科技有限公司 | 一种视频图像处理方法、系统及终端设备 |
WO2021139418A1 (zh) * | 2020-01-09 | 2021-07-15 | 西安万像电子科技有限公司 | 图像处理装置、远端设备及通信系统 |
WO2021196087A1 (zh) * | 2020-04-01 | 2021-10-07 | 华为技术有限公司 | 视频增强的方法及装置 |
CN114173137A (zh) * | 2020-09-10 | 2022-03-11 | 北京金山云网络技术有限公司 | 视频编码方法、装置及电子设备 |
-
2022
- 2022-10-19 CN CN202211282417.2A patent/CN117956158A/zh active Pending
-
2023
- 2023-10-08 WO PCT/CN2023/123349 patent/WO2024082971A1/zh unknown
- 2023-10-08 EP EP23878963.0A patent/EP4447441A1/en active Pending
-
2024
- 2024-05-01 US US18/652,316 patent/US20240291995A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021139418A1 (zh) * | 2020-01-09 | 2021-07-15 | 西安万像电子科技有限公司 | 图像处理装置、远端设备及通信系统 |
WO2021196087A1 (zh) * | 2020-04-01 | 2021-10-07 | 华为技术有限公司 | 视频增强的方法及装置 |
CN111669589A (zh) * | 2020-06-23 | 2020-09-15 | 腾讯科技(深圳)有限公司 | 图像编码方法、装置、计算机设备以及存储介质 |
CN111711859A (zh) * | 2020-06-28 | 2020-09-25 | 北京奇艺世纪科技有限公司 | 一种视频图像处理方法、系统及终端设备 |
CN114173137A (zh) * | 2020-09-10 | 2022-03-11 | 北京金山云网络技术有限公司 | 视频编码方法、装置及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
US20240291995A1 (en) | 2024-08-29 |
CN117956158A (zh) | 2024-04-30 |
EP4447441A1 (en) | 2024-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9402034B2 (en) | Adaptive auto exposure adjustment | |
KR101644208B1 (ko) | 이전에 계산된 모션 정보를 이용하는 비디오 인코딩 | |
US9071841B2 (en) | Video transcoding with dynamically modifiable spatial resolution | |
JP3133517B2 (ja) | 画像領域検出装置、該画像検出装置を用いた画像符号化装置 | |
US10205953B2 (en) | Object detection informed encoding | |
US6097757A (en) | Real-time variable bit rate encoding of video sequence employing statistics | |
TWI586177B (zh) | 基於場景之適應性位元率控制 | |
JP2019501554A (ja) | 動的な解像度切換えを用いたリアルタイムビデオエンコーダレート制御 | |
JP2004248285A (ja) | 画像通話時における話者の映像の差動的符号化可能のビデオエンコーダ及びこれを利用したビデオ信号圧縮方法 | |
US10623744B2 (en) | Scene based rate control for video compression and video streaming | |
US20090074075A1 (en) | Efficient real-time rate control for video compression processes | |
US10277907B2 (en) | Rate-distortion optimizers and optimization techniques including joint optimization of multiple color components | |
CN113785573A (zh) | 编码器、解码器和使用自适应环路滤波器的对应方法 | |
US10812832B2 (en) | Efficient still image coding with video compression techniques | |
US9565404B2 (en) | Encoding techniques for banding reduction | |
US8731048B2 (en) | Efficient temporal search range control for video encoding processes | |
CA3039702A1 (en) | Systems and methods for compressing video | |
WO2024082971A1 (zh) | 一种视频处理方法及相关装置 | |
WO2021007702A1 (en) | Video encoding method, video decoding method, video encoding device, and video decoding device | |
KR20130078569A (ko) | 관심영역 기반의 화질 향상을 위한 스크린 콘텐츠 비디오 부호화/복호화 방법 및 그 장치 | |
CN117616751A (zh) | 动态图像组的视频编解码 | |
WO2024217464A1 (en) | Method, apparatus, and medium for video processing | |
WO2024217530A1 (en) | Method and apparatus for image encoding and decoding | |
WO2023130899A1 (zh) | 环路滤波方法、视频编解码方法、装置、介质及电子设备 | |
US20240244229A1 (en) | Systems and methods for predictive coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23878963 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023878963 Country of ref document: EP Effective date: 20240708 |