[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118414829A - System and method for feature-based rate-distortion optimization for object and event detection and for video coding - Google Patents

System and method for feature-based rate-distortion optimization for object and event detection and for video coding Download PDF

Info

Publication number
CN118414829A
CN118414829A CN202280084404.6A CN202280084404A CN118414829A CN 118414829 A CN118414829 A CN 118414829A CN 202280084404 A CN202280084404 A CN 202280084404A CN 118414829 A CN118414829 A CN 118414829A
Authority
CN
China
Prior art keywords
video
picture
features
bitstream
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280084404.6A
Other languages
Chinese (zh)
Inventor
菲力博·阿兹克
博里约夫·福尔特
哈利·卡瓦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Op Solutions Co
Original Assignee
Op Solutions Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Op Solutions Co filed Critical Op Solutions Co
Priority claimed from PCT/US2022/047829 external-priority patent/WO2023081047A2/en
Publication of CN118414829A publication Critical patent/CN118414829A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A system and method for event and object detection and annotation in a video stream may include: extracting a plurality of features in a picture in a video frame, grouping at least a portion of the plurality of features into at least one object, determining a region for the at least one object, assigning an object identifier to the at least one object, and encoding the object identifier into a bitstream. Also disclosed are systems and methods for feature-based rate-distortion optimization for video coding, and include extracting a set of features from a picture in a video, generating a correlation map for the extracted features, determining a correlation score for a portion of the picture using the correlation map, and encoding the portion of the picture with a bit rate determined at least in part by the correlation score.

Description

System and method for feature-based rate-distortion optimization for object and event detection and for video coding
Cross Reference to Related Applications
The present application claims priority from U.S. provisional patent application No. 63/275,700, filed on 411 2021, entitled "systems and methods for object and event detection and annotation in video streams," and U.S. provisional patent application No. 63/275,740, filed on 411 2021, entitled "systems and methods for feature-based rate distortion optimization for video coding," the disclosures of both of which are incorporated herein by reference in their entirety.
Technical Field
The present invention relates generally to the field of video encoding and decoding. In particular, the present invention relates to systems and methods for feature-based rate-distortion optimization for object and event detection and for video coding.
Background
The video codec may include electronic circuitry or software that compresses or decompresses digital video. The video codec may convert uncompressed video into a compressed format and vice versa. In the context of video compression, a device that compresses video (and/or performs some function thereof) may be generally referred to as an encoder, while a device that decompresses video (and/or performs some function thereof) may be referred to as a decoder.
The format of the compressed data may conform to standard video compression specifications. Compression may be lossy because the compressed video lacks some of the information present in the original video. Such consequences may include that the decompressed video may have a lower quality than the original uncompressed video, as there is insufficient information to accurately reconstruct the original video.
There may be complex relationships between video quality, the amount of data used to represent the video (e.g., determined by bit rate), the complexity of the encoding and decoding algorithms, the susceptibility to data loss and errors, the ease of editing, random access, end-to-end delay (e.g., time delay), etc.
Motion compensation may include a method by which a given reference frame (e.g., a previous frame and/or a future frame) predicts a video frame or portion thereof by taking into account the motion of objects in the camera and/or video. Motion compensation may be employed in the encoding and decoding of video data for video compression, such as in the encoding and decoding of the advanced video coding (advanced video coding, AVC) standard (also known as h.264) using the moving picture experts group (motion picture experts group, MPEG). Motion compensation may describe a picture in terms of a transformation of a reference picture to a current picture. The reference picture may be temporally previous when compared to the current picture or come from the future when compared to the current picture. Compression efficiency may be improved when pictures may be accurately synthesized from previously transmitted and/or stored pictures.
Recent trends have introduced cases in robotics, monitoring, surveillance, internet of things, etc., in which most of all pictures and videos recorded in the field are consumed only by machines and never reach the human eye. These machines process pictures and videos in order to accomplish tasks such as object detection, object tracking, segmentation, event detection, and the like. The international standardization organization recognizes this trend as ubiquitous and will accelerate in the future, thus creating a work of standardizing picture and video coding, which is optimized mainly for machine consumption. For example, standards such as JPEG AI and for machine video coding are launched in addition to already established standards (e.g., compressed descriptors for visual search and compressed descriptors for video analysis).
Rate distortion optimization may be used to improve video coding. As the name suggests, this process refers to the optimization of the amount of data needed to encode video versus the amount of distortion (video quality loss), i.e., ratio. The present disclosure relates in part to systems and methods for rate-distortion optimization applied in the context of video coding and hybrid video systems for machine consumption.
Disclosure of Invention
There is provided a method of encoding video, comprising: extracting a plurality of features in a picture in a video frame, grouping at least a portion of the plurality of features into at least one object, determining a region for the at least one object, assigning an object identifier to the at least one object, and encoding the object identifier into a bitstream. In some embodiments, a feature model is used to extract a plurality of features. The object identifier may include a region identifier and a tag for each object.
The region is preferably represented by a geometric representation. In some embodiments, the geometric representation is a bounding box or outline. When the geometric representation is a bounding box, the bounding box may be a rectangle identified by the coordinates of the particular corners and width and height of the bounding box. Alternatively, the bounding box may be a rectangle identified by the coordinates of two diagonally opposite corners. If the geometric representation is a contour, the contour may be represented by a set of consecutive angles. For example, the first angle and the successive angle define the entire profile clockwise or counterclockwise. In each case, the bounding box or contour may be defined at the coding unit level, and the corners represent the corners of the coding units.
In some embodiments, objects may be further evaluated over a sequence of frames to determine events. The event identifier may be associated with an object encoded into the bitstream and the event identifier.
The object identifier and the event identifier may be inserted into the encoded bitstream. This information may be provided as supplemental enhancement information. Alternatively or additionally, the bitstream may include slice header information, and the slice header information may signal the presence of objects in the slice.
The video encoding method for identifying objects and events may further include features for rate distortion optimization, including generating a correlation map for the extracted features, determining a correlation score for a portion of the picture using the correlation map, and encoding the portion of the picture using a bit rate determined at least in part by the correlation score.
A method of encoding video with rate-distortion optimization, comprising: extracting a set of features from pictures in the video, generating a correlation map for the extracted features; a relevance score is determined for a portion of a picture using a relevance map, and the portion of the picture is encoded with a bitrate determined at least in part by the relevance score. In some embodiments, the picture is represented by a plurality of coding units, and the correlation map is determined at the coding unit level if each coding unit has a coding unit correlation score. The encoding operation preferably includes assigning a bit rate to each coding unit. In some cases, the relevance score may include a relative relevance score for each coding unit.
In some embodiments, the encoding operation includes at least one of intra-prediction, motion estimation, and transform quantization. In this case, the relative correlation score may be used in an explicit rate-distortion optimization mode to change encoding during at least one of intra prediction, motion estimation, and transform quantization processes. Alternatively, the relative correlation score may also be used in a rate-distortion function to determine an adjusted bit rate for each coding unit.
Video coding methods with rate distortion optimization may also use extracted features for object and event recognition. In some embodiments, this may include grouping at least a portion of the extracted features into at least one object, determining a region for the at least one object, assigning an object identifier to the at least one object, and encoding the object identifier into the bitstream.
An encoded video bitstream is also provided. The encoded bitstream includes encoded video content data having at least one object identified by an encoder for extracting a plurality of features of a picture in video content. The bitstream includes at least one object identifier and associated object annotation and at least one event identifier and associated event annotation. The bitstream may include Supplemental Enhancement Information (SEI) messages in which information related to at least one object and at least one event is signaled. Alternatively or additionally, the bitstream may include slice header information in which information related to at least one object and at least one event in the video slice is signaled.
These and other aspects and features of the non-limiting embodiments of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of the specific non-limiting embodiments of the invention in conjunction with the accompanying figures.
Drawings
For the purpose of illustrating the invention, the drawings show various aspects of one or more embodiments of the invention. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, wherein:
FIG. 1 is a block diagram illustrating an exemplary embodiment of a hybrid video coding system;
FIG. 2 is a block diagram illustrating an exemplary embodiment of video encoding for a machine or hybrid system;
FIG. 3 is a block diagram further illustrating an exemplary embodiment of video encoding for a machine system;
fig. 4A-4C are schematic diagrams illustrating exemplary embodiments of an input picture, feature detection with bounding box, and object with detected outline;
Fig. 5A to 5C are schematic diagrams further illustrating examples of an input picture, feature detection with bounding box at the encoding unit level, and an object with a detected contour;
fig. 6A and 6B show examples of 8 x 8 pixel blocks containing features with contours and the resulting 8 x 8 correlation map, respectively;
FIG. 7 is a simplified flow diagram of a method of feature-based rate-distortion optimization according to the present disclosure;
FIG. 8 is a block diagram illustrating an exemplary embodiment of a machine learning module;
FIG. 9 is a schematic diagram illustrating an exemplary embodiment of a neural network;
fig. 10 is a schematic diagram illustrating an exemplary embodiment of a node of a neural network.
FIG. 11 is a block diagram illustrating an exemplary embodiment of a video decoder;
FIG. 12 is a block diagram illustrating an exemplary embodiment of a video encoder; and
FIG. 13 is a block diagram of a computing system that may be used to implement any one or more of the methods disclosed herein and any one or more portions thereof.
The drawings are not necessarily to scale and may be illustrated by broken lines, diagrammatic representations and fragmentary views. In certain instances, details that are not necessary for an understanding of the embodiments or that render other details difficult to perceive may have been omitted.
Detailed Description
In many applications (e.g., surveillance systems with multiple cameras, intelligent transportation, smart city applications, and/or intelligent industrial applications), conventional video coding may require that a large amount of video from the cameras be compressed and transmitted over a network to the machine and for human consumption. Algorithms for feature extraction may then typically be applied at the machine site using convolutional neural networks or deep learning techniques (including object detection, event motion recognition, pose estimation, etc.). Fig. 1 shows an exemplary embodiment of a standard Versatile Video Coding (VVC) encoder applied to a machine. However, conventional approaches require a large number of video transmissions from multiple cameras, which can require a significant amount of time to make efficient and fast real-time analysis and decisions. In an embodiment, a machine Video Coding (VCM) method may solve this problem by coding the video at the transmitter site and extracting some of the features, and then sending the resulting coded bitstream to a VCM decoder. As used herein, the term "VCM" is not limited to the particular protocol involved, but more generally includes all systems for encoding and decoding video for machine consumption. At the decoder site, the video may be decoded for human vision and the features may be decoded for the machine. Systems that provide video for human vision and machine consumption are sometimes referred to as hybrid systems. The systems and methods disclosed herein are intended to apply to machine-based systems as well as hybrid systems.
Systems and methods for rate-distortion optimization (RDO) for video coding based on features extracted from an input video are disclosed. The method is applicable to any system that receives a video signal as input and that can perform feature extraction and video coding. Feature extraction may be categorized as any computer vision task, such as edge detection, line detection, object detection, or update techniques such as convolutional neural networks, where the output of feature extraction may be spatially mapped back into the pixel space of the input video. Video coding may include any standard video encoder employing rate-distortion optimization, and/or coding techniques such as segmentation, motion estimation, and transform/quantization, for example, versatile Video Coding (VVC) or High Efficiency Video Coding (HEVC).
An embodiment of a system supporting the present method is the system for machine video coding shown in fig. 1 and 2 below.
FIG. 1 is a high-level block diagram of a system for encoding and decoding video in a hybrid system that includes consumption of video content by human viewers and machine consumption. The source video is received by a video encoder 105, which encoder 105 provides a compressed bit stream for transmission over a channel to a video decoder 110. The video encoder may encode video for human consumption and video for machine consumption. The video decoder 110 provides complementary processing on the compressed bitstream to extract video 115 for human vision and task analysis and feature extraction 120 for machine consumption.
Referring now to FIG. 2, an exemplary embodiment of an encoder for machine Video Coding (VCM) is illustrated. VCM encoder 200 may be implemented using any circuitry including, but not limited to, digital and/or analog circuitry; the VCM encoder 200 may be configured using a hardware configuration, a software configuration, a firmware configuration, and/or any combination thereof. VCM encoder 200 may be implemented as a computing device and/or a component of a computing device, which may include, but is not limited to, any computing device as described below. In one embodiment, VCM encoder 200 may be configured to receive input video 204 and generate output bitstream 208. The receipt of the input video 204 may be accomplished in any of the ways described below. The bitstream may include, but is not limited to, any of the bitstreams described below.
VCM encoder 200 may include, but is not limited to, a pre-processor 212, a video encoder 216, a feature extractor 220, an optimizer 224, a feature encoder 228, and/or a multiplexer 232. The preprocessor 212 may receive the input video 204 stream and parse out video, audio, and metadata substreams of the stream. The preprocessor 212 may include and/or be in communication with a decoder, as described in further detail below; in other words, the preprocessor 212 may have the ability to decode an input stream. In a non-limiting example, this may allow for decoding of the input video 204, which may facilitate downstream pixel domain analysis.
With further reference to fig. 2, vcm encoder 200 may operate in a hybrid mode and/or a video mode. When in hybrid mode, VCM encoder 200 may be configured to encode a visual signal for a human consumer to encode a characteristic signal for a machine consumer; a machine consumer may include, but is not limited to, any device and/or component, including, but not limited to, a computing device described in further detail below. The input signal may be passed through the pre-processor 212, for example, in a mixed mode.
Still referring to fig. 2, video encoder 216 may include, but is not limited to, any video encoder 216, as described in further detail below. When VCM encoder 200 is in hybrid mode, VCM encoder 200 may send unmodified input video 204 to video encoder 210 and send copies of the same input video 204 and/or input video 204 that has been modified in some way to feature extractor 220. For example, but not limited to, the input video 204 may be resized to a smaller resolution, a certain number of pictures in the sequence of pictures in the input video 204 may be discarded (thereby reducing the frame rate of the input video 204), color information may be modified, for example, but not limited to, by converting RGB video to gray scale video, etc.
Still referring to fig. 2, the video encoder 216 and the feature extractor 220 are connected and can exchange useful information in both directions. For example, but not limited to, video encoder 216 may communicate motion estimation information to feature extractor 220 and vice versa. The video encoder 216 may provide the quantization map and/or its descriptive data to the feature extractor 220 based on a region of interest (ROI), or vice versa, wherein the video encoder 216 and/or the feature extractor 220 may identify the region of interest. The video encoder 216 may provide data describing one or more partitioning decisions to the feature extractor 220 based on features present and/or identified in the input video 204, the input signal, and/or any frames and/or subframes thereof; feature extractor 220 may provide data describing one or more partitioning decisions to video encoder 216 based on features present and/or identified in input video 204, the input signal, and/or any frames and/or subframes thereof. The video encoder 216 and feature extractor 220 may share and/or transmit temporal information to each other for optimal group of pictures (GOP) decisions. Each of these techniques and/or processes may be performed without limitation, as described in further detail below.
With continued reference to fig. 2, the feature extractor 220 may operate in either an offline mode or an online mode. Feature extractor 220 may identify and/or otherwise act upon and/or manipulate features. As used herein, a "feature" is a particular structure and/or content attribute of data. Examples of features may include SIFT, audio features, color histograms, motion histograms, speech levels, loudness levels, and the like. Features may be time stamped. Each feature may be associated with a single frame in a group of frames. Features may include advanced content features such as time stamps, labels of people and objects in video, coordinates of objects and/or regions of interest, frame masks based on quantization of regions, and/or any other features that would occur to one skilled in the art after reviewing the entire content of the present invention. As further non-limiting examples, the features may include features describing spatial and/or temporal characteristics of a frame or group of frames. Examples of features that describe spatial and/or temporal characteristics may include motion, texture, color, brightness, edge count, blur, blockiness, and so forth. When in offline mode, all machine models as described in further detail below may be stored at and/or in memory of and/or accessible by the encoder. Examples of such models may include, but are not limited to, a convolutional neural network in whole or in part, a keypoint extractor, an edge detector, a saliency map builder, and the like. While in online mode, one or more models may be transmitted by the remote machine to feature extractor 220 in real-time or at some point prior to extraction.
Still referring to fig. 2, feature encoder 228 is configured to encode a feature signal, such as, but not limited to, that generated by feature extractor 220. In one embodiment, after extracting the features, the feature extractor 220 may pass the extracted features to the feature encoder 228. The feature encoder 228 may use entropy encoding and/or similar techniques (e.g., without limitation, as described below) to generate a feature stream, which may be passed to the multiplexer 232. The video encoder 216 and/or the feature encoder 228 may be coupled via an optimizer 224. The optimizer 224 may exchange useful information between the video encoder 216 and the feature encoder 228. For example, but not limited to, information related to the construction and/or length of entropy encoded codewords may be exchanged and reused via the optimizer 224 for optimal compression.
In one embodiment, with continued reference to fig. 2, video encoder 216 may generate a video stream; the video stream may be passed to a multiplexer 232. The multiplexer 232 may multiplex the video stream with the feature stream generated by the feature encoder 228. Alternatively or additionally, the video and feature bitstreams may be transmitted to different devices over different channels, different networks, and/or at different times or time intervals (time multiplexing). Each of the video stream and the feature stream may be implemented in any manner suitable for implementing any of the bitstreams as described herein. In one embodiment, the multiplexed video stream and feature stream may produce a mixed bit stream, which may be transmitted as described in further detail below.
Still referring to fig. 2, where VCM encoder 200 is in video mode, VCM encoder 200 may use video encoder 216 for both video and feature encoding. Feature extractor 220 may send the features to video encoder 216; the video encoder 216 may encode the features into a video stream that may be decoded by a corresponding video decoder 244. It should be noted that VCM encoder 200 may use a single video encoder 216 for both video encoding and feature encoding, in which case it may use different parameter sets for video and features; alternatively, VCM encoder 200 may have two independent video encoders 216, which may operate in parallel.
Still referring to FIG. 2, the system 200 may include and/or be in communication with a VCM decoder 236. The VCM decoder 236 and/or its elements may be implemented using any circuit and/or configuration type suitable for the configuration of the VCM encoder 200 as described above. VCM decoder 236 may include, but is not limited to, a demultiplexer 240. If multiplexed as described above, the demultiplexer 240 may operate to demultiplex the bit stream; for example, and without limitation, the demultiplexer 240 may separate a multiplexed bitstream containing one or more video bitstreams and one or more feature bitstreams into separate video and feature bitstreams.
With continued reference to fig. 2, vcm decoder 236 may include a video decoder 244. Without limitation, video decoder 244 may be implemented in any manner suitable for a decoder, as described in further detail below. In one embodiment, but not limited to, video decoder 244 may generate output video that may be viewed by a person or other creature and/or device having visual sensory capabilities.
Still referring to fig. 2, vcm decoder 236 may include a signature decoder 248. In one embodiment, but not limited to, feature decoder 248 may be configured to provide one or more decoded data to machine 260. The machine may include, but is not limited to, any computing device described below, including, but not limited to, any microcontroller, processor, embedded system, system on a chip, network node, or the like. The machine may operate, store, train, receive input from, generate output for, and/or otherwise interact with the machine model, as described in further detail below. The machine may be included in an internet of things (IoT), which is defined as a network of objects with processing and communication components, some of which may not be conventional computing devices such as desktop computers, laptop computers, and/or mobile devices. Objects in an IoT may include, but are not limited to, any device having an embedded microprocessor and/or microcontroller and one or more components for interfacing with a Local Area Network (LAN) and/or a Wide Area Network (WAN); the one or more components may include, but are not limited to, wireless transceivers that communicate, for example, in the 2.4-2.485GHz range (e.g., bluetooth transceivers that follow a protocol as published by bluetooth SIG company of coxland, washington), and/or network communication components that operate according to MODBUS protocols published by schneider electric SE of lupeyer-Ma Ermai pine, france, and/or ZIGBEE specifications of the IEEE 802.15.4 standard published by the Institute of Electronic and Electrical Engineers (IEEE). Those skilled in the art will recognize, after reviewing the entire disclosure of the present invention, various alternative or additional communication protocols and devices supporting such protocols, each of which is deemed to be within the scope of the present invention, which may be employed consistent with the present invention.
With continued reference to fig. 2, each of VCM encoder 200 and/or VCM decoder 236 may be designed and/or configured to perform any method, method step, or sequence of method steps in any of the embodiments described herein in any order and with any degree of repetition. For example, each of the VCM encoder 200 or VCM decoder 236 may be configured to repeatedly perform a single step or sequence until a desired or commanded result is achieved; the repetition of a step or sequence of steps may be performed iteratively and/or recursively, using a previously repeated output as a subsequently repeated input, aggregating the repeated inputs and/or outputs to produce an aggregate result, thereby reducing or shrinking one or more variables (e.g., global variables) and/or partitioning a larger processing task into a set of iteratively addressed smaller processing tasks. VCM encoder 200 and/or VCM decoder 236 may each perform any step or sequence of steps as described herein in parallel, e.g., performing steps two or more times simultaneously and/or substantially simultaneously using two or more parallel threads, processor cores, etc.; the division of tasks between parallel threads and/or processes may be performed according to any protocol suitable for dividing tasks between iterations. Those skilled in the art will appreciate, upon review of the entire disclosure, that steps, sequences of steps, processing tasks, and/or data may be subdivided, shared, or otherwise processed in a variety of ways using iterative, recursive, and/or parallel processing.
Fig. 3 is a block diagram further illustrating an encoder with feature and object detection and rate distortion optimization. The input video is passed to a feature extractor which calculates and extracts relevant features and sends relevant information about these features (size, position, relevance score, labels, etc.) to an encoder which uses this information to adjust the rate distortion optimization so that regions with more relevant features are encoded with higher quality (higher bit rate). The distribution of available bandwidth is modulated based on a correlation map rather than the default content independent distribution used by the encoder. The relevant components depicted in fig. 3 include rate-distortion optimization 315, intra-prediction 320, transform/quantization 325, motion estimation 330, and entropy coding 335.
There are four basic phases of operation: feature extraction 305, feature correlation mapping 310, application of correlation mapping in a rate distortion optimization process, and encoding by video encoder 300 at a rate related to feature correlation. This process is further illustrated in the flow chart of fig. 7. At a high level, an encoding method with rate distortion optimization according to the present disclosure extracts features from an input video (step 705), generates a correlation map of the features, preferably at the encoding unit level (step 710), determines CU correlation scores (step 715), and encodes each CU at a rate determined at least in part by the CU's correlation scores. These steps are further described below.
Fig. 3 depicts the components of the proposed system and the connections between them. Feature extractor 305 generates a correlation map that is used by the encoder in two possible modes-explicit mode (dashed line) or implicit mode (solid line). The following is a detailed description of each component and mode.
Feature extraction
As described above, the feature extractor 305 performs a process of extracting relevant features from an input video. Feature extractor 305 may implement more simplified picture processing and computer vision techniques (e.g., edge detection, line detection, object detection), or more complex techniques (e.g., convolutional Neural Network (CNN)) that may detect and identify objects and actions.
Any feature extraction process that can map pixel locations of an input picture can be used to generate the correlation map 310. In the example of detecting edges, lines, and objects, corresponding pixels representing/containing edges, lines, or objects are assigned appropriate high correlation values, while the remaining pixels in the picture are assigned low correlation values. Each pixel of the input picture is assigned a correlation value in the correlation map.
In the example using CNN, the output of any convolution layer (also referred to as a feature map) is mapped back to the input pixel with the appropriate pixel value.
Embodiments described herein may be implemented and/or configured to perform object and event detection and annotation using a VCM encoder. The input picture is passed to a feature extractor and a video encoder. The video encoder is connected to the feature extractor 305 and may receive additional information about the input picture. Once the picture is processed by the feature extractor 305, relevant information about the detected objects and events is sent to the video encoder 300.
The feature extractor 305 extracts relevant information about objects and events in the input picture using feature models such as convolutional neural networks, keypoint extractors, edge detectors, saliency map constructors, and the like.
The output of the feature extractor 305 is a set of [ region, label ] pairs for each picture. The region may be represented as a bounding box (FIG. 4B), a contour (FIG. 4C), or other geometric representation. Tags are represented as words, strings, or other unique identifiers. Examples of input pictures and feature detection are shown in fig. 4A and 4B, respectively.
The bounding boxes 405, 410 may be represented using the upper left corner coordinates and the width and height (x, y, w, h). The profile may be represented using a clockwise succession of sets of angles, such as: (x 1, y1, x2, y2, x3, y3, x4, y4, x5, y 5) the coordinates of adjacent corners can be used to implicitly draw edges between corners. The tag for each detection may be represented as a string representing the relevant word, such as "car", "person", etc.
The detection for the input picture is sent to the video encoder in triples, for example: the [ (x 1, y1, w1, h 1), "car", id1, (x 2, y2, w2, h 2), "person", id2] -third parameter is used for event id, which is equal to 0 if the detection is for an object instead of an event. The video encoder may copy or convert the information into annotations in an appropriate format that are added as metadata to the video stream or explicitly signaled.
To extend the concept of detection from objects to events, feature extractor 305 may process multiple consecutive pictures and combine single picture detection into a higher abstraction. One example of this process is when the feature extractor 305 detects a car occupying the same spatial region in n consecutive pictures, and detects a person occupying a spatial region that is closer and closer to the car region in a subsequent picture. In this case, the entire detection sequence may be abstracted as an event marked as "person is entering a car". The event is notified using an event ID that exists in a plurality of consecutive pictures and is interpreted by the video encoder, and the event list is transmitted to the video encoder as a set of [ ID, "event" ] pairs. Object detection example: [ (x 1, y1, w1, h 1), "car", (x 2, y2, w2, h 2), "person", 0]. Event detection example: [ (x 1, y1, w1, h 1), "car", 1, (x 2, y2, w2, h 2), "person", 1], [1, "person is entering car" ].
The video encoder may map the detected region to the encoding unit using the detection setting information. Fig. 5 gives an example of the mapping. The coding unit may be represented as a macroblock, a tree coding unit, or a coding unit according to the video coding standard used. Any coding unit that contains the entire region or any portion thereof is considered an Annotation Coding Unit (ACU). The encoder may use ACU information to adjust parameters of the encoding process, such as quantization, partitioning, prediction type, etc. In some cases, ACUs contain information that is considered to have a higher priority than the rest of the picture and that is encoded accordingly, typically using more bandwidth, which corresponds to, for example, a lower quantization level. In the case of event detection, the encoder may preserve more detail, for example, using finer resolution of motion estimation and fractional motion vector precision.
Some embodiments disclosed herein may perform and/or be configured to perform object and event annotation signaling to a video decoder using metadata. As already described, information about the detection is passed from the feature extractor to the video decoder in the form of a set of pairs or triples. This information may be passed (copied) as is or converted to a different representation, which is then inserted as metadata into the video bitstream, for example using SEI (Supplemental Enhancement Information ). The following table shows one example of such SEI message syntax:
The SEI message contains the elements defined within the initial payloadSize bytes, as well as additional loads of unspecified size reserved for future use and expansion.
Any decoder that implements SEI message parsing can extract SEI messages from the bitstream and process information about objects and events detected in the video sequence. The parsed information may be used by the encoder to generate text reports about objects and events in the video, or it may be used to present geometry along with text information (e.g., tags) at the top of the video to help human viewers identify objects and events.
Embodiments described herein may be implemented and/or configured to perform explicit signaling of object and event annotations to a video decoder. Information about objects and events received from the feature extractor may also be converted into coding unit syntax elements that exist at the slice level or at the level of a Coding Tree Unit (CTU).
In one implementation, the slice header information (SLICE HEADER, SH) is used to signal the presence of objects or events in a given slice. If a slice contains an object or event, or a part of an object or event, the proposed syntax element signals the decoder of the existing object or event. The slice header information contains a list of coding units belonging to the annotated object or event in sequential raster scan order. The following table gives examples of SH elements:
Upon receiving the slice header information, the decoder parses the information and marks all CTUs containing parts of the object and event. In this implementation, the region containing the annotation object and event is always represented as a set of contiguous CTUs.
Fig. 5A to 5C depict examples of feature extraction that detects an object and outputs an object contour at the coding unit level.
And (3) generating an association mapping:
each pixel belonging to an edge, line, object or any other region containing the relevant feature is assigned a value. Each pixel that does not belong to the relevant region is assigned a zero value or some other low value. In the following example, we will assume a value range between 0 and 1, and a real-valued number is assigned to each pixel. Without limitation, the proposed method also supports other numerical ranges.
The assigned values are application dependent and may be predetermined or normalized using information obtained from the feature extraction process. For example, if only horizontal lines are detected, all pixel values belonging to these lines are assigned a value of 1.0, while all other pixels are assigned a value of 0.0. If the extraction process detects many oriented lines, the horizontal and vertical lines may be assigned a higher value than non-cardinal oriented lines, e.g., all cardinal oriented lines may be assigned a value of 1.0, all ordered oriented lines may be assigned a value of 0.75, and all other lines 0.5. In the case of object detection, if only one object is detected, it may be assigned a value of 1.0, but if several objects are detected in the same picture, each object may be assigned a different value based on the size of the object or the predetermined importance of a given class of objects. For example, the largest object may be designated as 1.0, and each subsequent object in the order of size may be designated as a lower value. On the other hand, the detected face can be assigned a higher value than the automobile, or the like, regardless of the size.
In figure 6A we depict a simple example of an 8 x 8 pixel block containing features with contours, and the resulting 8 x 8 correlation map shown in figure 6. The full correlation map has the same dimensions as the input picture and is used by the encoder to map pixel correlation to Rate Distortion Optimization (RDO) decisions.
Feature-based RDO
Typically, a video encoder does not make decisions at a single pixel level, but rather at a so-called Coding Unit (CU) level. These are typically rectangular blocks of dimensions such as 64 x 64 pixels, 32 x 32 pixels, 16 x 16 pixels, etc. Since RDO decisions are made at the level of a single or group of CUs, the correlation map values for the pixels are averaged to obtain a CU correlation score.
As in fig. 6A and 6B, a CU that contains a feature or partial feature will be designated as more relevant than all other CUs in a given picture, as indicated by the value 1 in fig. 6B.
Thus, the video encoder will attempt to encode each CU in a given picture that considers the relevance score, in addition to all other considerations that exist in the RDO algorithm by default. In most cases, the CU with the lower correlation score will be encoded using a lower bit rate and vice versa.
For each pixel p and each correlation value v (p) in an n×n CU, the correlation score (RS) of the CU is calculated as follows:
This value is then compared to all other RS (CU) values in a given picture and a Relative Relevance Score (RRS) is calculated:
where K is the total number of CU units in a given picture.
At the time of encoding, RRS (CU) is calculated for each unit considered by the encoder. In other words, the encoder may estimate the RD cost of one 64×64 unit and calculate its RRS (CU) value, and then estimate the cost of four 32×32 subunits and calculate its RRS (CU) value.
RRS (CU) is used by RDO to adjust the bit rate allocation of each coding unit. The encoder may apply RRS to the coding parameters using two modes: (1) explicit mode: in this mode, RRS (CU) is used for decision-intra prediction, motion estimation or transform/quantization in a specific stage of modulation coding; (2) implicit mode: in this mode, the RRS (CU) is directly used for the rate-distortion function. The following is a description of each mode.
Explicit mode
In explicit mode, the encoder uses RRS (CU) to modulate decisions in the following process: (1) Intra prediction 320-specifically, the partitioning process is adjusted based on RRS (CU). The partitioning process is staged-each stage is performed at a higher partitioning depth. Higher depths result in smaller CUs, thus allowing finer details to be preserved. Only the lower partition depth is estimated if the RRS (CU) is low, and only the higher depth is estimated if the RRS (CU) is high. In this way, computational resources are saved and bit rates and quality are allocated according to the correlation. (2) Motion estimation 330-specifically, the motion estimation accuracy and search range are adjusted based on RRS (CU). The low score is turning off the score motion vector accuracy and reducing the motion vector search range. The high score is the opposite. Also, the effect of the adjustment is similar to the intra case. (3) Transform/quantization 325-in particular, the transform type and quantization level are adjusted based on RRS (CU). The transform for the lower fractional units is a simpler transform (e.g., hadamard transform rather than discrete cosine transform), while the higher fractional units still use full complexity transform. The quantization level is adjusted based on RRS (CU) by directly applying coefficients inversely proportional to the score to the quantization level (quantization parameter). Furthermore, the highest RRS (CU) score may use a transform skip mode and be encoded as the lossless region of the picture containing the features of highest relevance. In addition to transform skip mode, this can be achieved using tools such as those available for transform, scaling and quantization in the VVC standard: disabling Sub-Block Transform (SBT), disabling Intra Sub-partitioning (ISP), disabling multi-Transform selection (Multiple Transform Selection, MTS), disabling Low-Frequency Non-separable transforms (Low-Frequency Non-Separable Transform, LFNST), disabling color difference joint coding (Joint Coding of Chroma Residual, JCCR), disabling correlated quantization (DEPENDENT QUANTIZATION, DQ), and VVC tools for Intra-loop filtering: disabling the deblocking filter (Deblocking Filter, DF), disabling the sample adaptive Offset (SAMPLE ADAPTIVE Offset, SAO), disabling the adaptive loop filter (Adaptive Loop Filter, ALF), and disabling the rimer mapping with Chroma scale (LIMA MAPPING WITH Chroma Scaling, LMCS).
Implicit mode
In implicit mode, RRS (CU) is directly used for the rate-distortion function. Since this function determines the cost of the coding decision, this adjustment implicitly affects all other aspects of the coding (segmentation, motion estimation, transform/quantization, etc.).
The form of standard RD function is as follows: j=d+λr, where J is a cost function, D is a distortion measure, R is a bit rate, and the lagrangian multiplier λ for unconstrained optimization is used. The purpose of the encoder is to obtain a set of encoding parameters that minimizes the cost function J (findmin (J)). The adjusted RD function is j=d+λ R R, where λ R is calculated as follows:
λR=λ(1+c(d–RSS(CU))
Here we assume that RSS (CU) is normalized to the range (0.0, 1.0). Where c is the adjustment coefficient from the range (0.0, 1.0) and d is the offset coefficient. For example, if c=0.2, d=0.5, the formula becomes:
λR=λ(1+0.2(0.5–RSS(CU))
Thus, λ R =λ when RRS (CU) =0.0, and λ R =0.9 when RRS (CU) =1.0.
The value of lambda R decreases with higher RSS (CU), resulting in higher bit rates for those coding units, and vice versa for coding units with lower correlation. The correct coefficients may be calculated based on the application and use case. They may also be trained using neural networks to achieve the desired rate-distortion cost for a given feature set.
Referring now to FIG. 8, an exemplary embodiment of a machine learning module 800 that may perform one or more machine learning processes as described herein is illustrated. The machine learning module may use a machine learning process to perform the determining, classifying, and/or analyzing steps, methods, processes, etc., as described herein. As used herein, a "machine learning process" is a process that automatically uses training data 804 to generate an algorithm to be executed by a computing device/module to produce an output 808 with given data as input 812; this is in contrast to non-machine-learning software programs in which commands to be executed are predetermined by a user and written in a programming language.
Still referring to fig. 8, "training data" as used herein is data that contains correlations that may be used by a machine learning process to model relationships between two or more data element categories. For example, and without limitation, training data 804 may include a plurality of data entries, each entry representing a set of data elements that are recorded, received, and/or generated together; the data elements may be related by shared presence in a given data entry, by proximity in a given data entry, and so forth. The plurality of data entries in the training data 804 may represent one or more trends in correlations between categories of data elements; for example, but not limited to, a higher value of a first data element belonging to a first data element category may tend to correlate with a higher value of a second data element belonging to a second data element category, thereby indicating a possible proportional or other mathematical relationship linking value belonging to the two categories. In training data 804, multiple categories of data elements may be correlated according to various correlations; the correlation may indicate causal and/or predictive links between categories of data elements, which may be modeled as relationships, e.g., mathematical relationships, through a machine learning process, as described in further detail below. The training data 804 may be formatted and/or organized by category of data elements, for example by associating the data elements with one or more descriptors corresponding to the category of data elements. As a non-limiting example, training data 804 may include data entered by a person or process in a standardized form such that the input of a given data element in a given field in a table may be mapped to one or more descriptors of a category. The elements in the training data 804 may be linked to the descriptors of the categories by tags, tokens, or other data elements; for example, but not limited to, training data 804 may be provided in a fixed length format, a format that links the location of the data to categories (e.g., comma Separated Value (CSV) format), and/or a self-describing format (e.g., extensible markup language (XML), javaScript object notation (JSON), etc.) so that a process or device can detect a category of data.
Alternatively or additionally, and with continued reference to fig. 8, the training data 804 may include one or more elements that are unclassified; that is, the training data 804 may not be formatted or contain descriptors for some elements of the data. Machine learning algorithms and/or other processes may use, for example, natural language processing algorithms, tokenization, detection of correlation values in raw data, etc., to classify training data 804 according to one or more classifications; the categories may be generated using correlation and/or other processing algorithms. As a non-limiting example, in a corpus of text, phrases that make up a number "n" compound words (e.g., nouns modified by other nouns) may be identified according to a statistically significant popularity of an n-gram that contains such words in a particular order; like a single word, such an n-gram may be classified as a linguistic element, e.g., a "word", in order to be tracked, thereby generating a new category as a result of statistical analysis. Similarly, in data entries that include some text data, the person's name may be identified by reference to a list, dictionary, or other term schema, allowing for special classification by machine learning algorithms and/or automatic association of the data in the data entry with descriptors or to a given format. The ability to automatically classify data items may enable the same training data 804 to be applied to two or more different machine learning algorithms, as described in further detail below. Training data 804 used by machine learning module 800 may correlate any input data as described herein with any output data as described herein.
With further reference to fig. 8, the training data may be filtered, ranked, and/or selected using one or more supervised and/or unsupervised machine learning processes and/or models, as described in further detail below; such models may include, but are not limited to, a training data classifier 816. Training data classifier 816 may include a "classifier" as used in the present invention, which is a machine learning model defined, for example, as a mathematical model, a neural network, or a program generated by a machine learning algorithm known as a "classification algorithm" that classifies an input into a class or bin of data, thereby outputting the class or bin of data and/or a tag associated therewith, as described in further detail below. The classifier may be configured to output at least one data that marks or otherwise identifies data sets that are clustered together, found to be close under a distance metric as described below, and the like. The machine learning module 800 may generate the classifier using a classification algorithm defined as the process by which the computing device and/or any modules and/or components operating thereon derive the classifier from the training data 804. Classification may be performed using, but is not limited to, a linear classifier (e.g., without limitation, a logistic regression and/or a naive bayes classifier), a nearest neighbor classifier (e.g., a k nearest neighbor classifier), a support vector machine, a least squares support vector machine, fischer linear discriminant, a quadratic classifier, a decision tree, a boosting tree, a random forest classifier, a learning vector quantization, and/or a neural network-based classifier.
Still referring to fig. 8, the machine learning module 800 may be configured to perform an lazy learning process 820 and/or protocol, which may alternatively be referred to as a "lazy load" or "call-on-demand" process and/or protocol, by combining the inputs and training set to derive an algorithm to be used to produce the output on demand when the inputs to be converted to the output are received. For example, an initial set of simulations may be performed to cover an initial heuristic and/or "first guess" at the output and/or relationship. As a non-limiting example, the initial heuristic may include a ranking of associations between inputs and elements of training data 804. The heuristics may include selecting a certain number of highest ranked associations and/or training data 804 elements. The lazy learning may implement any suitable lazy learning algorithm including, but not limited to, K nearest neighbor algorithm, lazy naive bayes algorithm, etc.; those skilled in the art will appreciate, after reviewing the entire disclosure of the present invention, various lazy learning algorithms that may be applied to generate an output as described herein, including, but not limited to, lazy learning applications of machine learning algorithms as described in further detail below.
Alternatively or additionally, and with continued reference to FIG. 8, a machine learning process as described herein may be used to generate a machine learning model 824. As used herein, a "machine learning model" is a mathematical and/or algorithmic representation of the relationship between input and output, as generated using any machine learning process including, but not limited to, any of the processes described above, and stored in memory; once the input is created, it is submitted to a machine learning model 824, which generates an output based on the derived relationships. For example, but not limited to, a linear regression model generated using a linear regression algorithm may calculate a linear combination of input data using coefficients derived during a machine learning process to calculate output data. As a further non-limiting example, the machine learning model 824 may be generated by creating an artificial neural network (e.g., a convolutional neural network including an input layer of nodes, one or more middle layers, and an output layer of nodes). Connections between nodes may be created via a process of "training" the network in which elements from the set of training data 804 are applied to input nodes, and then appropriate training algorithms (e.g., levenberg-marquardt, conjugate gradients, simulated annealing, or other algorithms) are used to adjust the connections and weights between nodes in adjacent layers of the neural network to produce desired values at output nodes. This process is sometimes referred to as deep learning.
Still referring to fig. 8, the machine learning algorithm may include at least a supervised machine learning process 828. As defined herein, at least the supervised machine learning process 828 includes an algorithm that receives a training set correlating a plurality of inputs with a plurality of outputs and seeks to find one or more mathematical relationships correlating the inputs with the outputs, where each of the one or more mathematical relationships is optimal according to a certain criterion assigned to the algorithm using a certain scoring function. For example, a supervised learning algorithm may include inputs and outputs as described above in the present invention, as well as scoring functions representing desired forms of relationships to be detected between the inputs and outputs; for example, the scoring function may seek to maximize the probability that a given input and/or combination of element inputs is associated with a given output to minimize the probability that the given input is not associated with the given output. The scoring function may be expressed as a risk function representing an "expected loss" of an algorithm that relates input to output, where the loss is calculated as an error function representing the degree to which predictions generated by the relationship are incorrect when compared to a given input-output pair provided in training data 804. Those skilled in the art will appreciate, after reviewing the entire disclosure, the various possible variations of at least the supervised machine learning process 828 that may be used to determine the relationship between inputs and outputs. The supervised machine learning process may include a classification algorithm as defined above.
With further reference to fig. 8, the machine learning process may include at least an unsupervised machine learning process 832. As used herein, an unsupervised machine learning process is a process that derives reasoning in a dataset without regard to labels; thus, the unsupervised machine learning process may be free to discover any structure, relationships, and/or dependencies provided in the data. An unsupervised process may not require a response variable; an unsupervised process may be used to find inferences between patterns and/or variables of interest to determine the degree of correlation between two or more variables, etc.
Still referring to fig. 8, the machine learning module 800 may be designed and configured to create a machine learning model 824 using techniques for developing a linear regression model. The linear regression model may include a general least squares regression with the aim of minimizing the square of the difference between the predicted and actual results according to an appropriate norm (e.g., vector space distance norm) for measuring such difference; the coefficients of the resulting linear equation may be modified to improve minimization. The linear regression model may include a ridge regression method in which the function to be minimized includes a least squares function plus a term that multiplies the square of each coefficient by a scalar to penalize large coefficients. The linear regression model may include a Least Absolute Shrinkage and Selection Operator (LASSO) model, where ridge regression is combined with multiplying the least square term by a factor of 1 divided by twice the number of samples. The linear regression model may include a multitasking LASSO model, where the norm applied in the least squares term of the LASSO model is the freunds Luo Beini us norm equal to the square root of the sum of squares of all terms. The linear regression model may include an elastic mesh model, a multi-tasking elastic mesh model, a minimum angle regression model, a LARS LASSO model, an orthogonal matching pursuit model, a bayesian regression model, a logistic regression model, a stochastic gradient descent model, a perceptron model, a passive attack algorithm, a robust regression model, a Hu Ba (Huber) regression model, or any other suitable model that will occur to those of skill in the art upon review of the entire disclosure of the present invention. In one embodiment, the linear regression model may be generalized to a polynomial regression model, whereby polynomial equations (e.g., quadratic, cubic, or higher order equations) that provide the best predicted output/actual output fit are sought; as will be apparent to those skilled in the art after reviewing the entire disclosure of the present invention, methods similar to those described above may be applied to minimize error functions.
With continued reference to fig. 8, the machine learning algorithm may include, but is not limited to, linear discriminant analysis. The machine learning algorithm may include a quadratic discriminant analysis. The machine learning algorithm may include kernel ridge regression. The machine learning algorithm may include a support vector machine including, but not limited to, a regression process based on support vector classification. The machine learning algorithm may include a random gradient descent algorithm including classification and regression algorithms based on random gradient descent. The machine learning algorithm may include a nearest neighbor algorithm. The machine learning algorithm may include various forms of potential spatial regularization, such as variational regularization. The machine learning algorithm may include a gaussian process such as gaussian process regression. The machine learning algorithm may include a cross-factorization algorithm including partial least squares and/or a canonical correlation analysis. The machine learning algorithm may include a naive bayes approach. The machine learning algorithm may include a decision tree based algorithm, such as a decision tree classification or regression algorithm. The machine learning algorithm may include integrated methods such as bagging meta-estimators, randomized tree forests, adaBoost, gradient tree lifting, and/or voting classifier methods. The machine learning algorithm may include a neural network algorithm, including a convolutional neural network process.
Referring now to fig. 9, an exemplary embodiment of a neural network 900 is illustrated. The neural network 900, also referred to as an artificial neural network, is a network of "nodes" or a data structure having one or more inputs, one or more outputs, and a function that determines the outputs based on the inputs. Such nodes may be organized in a network such as, but not limited to, a convolutional neural network that includes an input layer 904 of nodes, one or more middle layers 908, and an output layer 912 of nodes. Connections between nodes may be created via a process of "training" the network in which elements from a training data set are applied to input nodes, and then appropriate training algorithms (e.g., levenberg-marquardt, conjugate gradients, simulated annealing, or other algorithms) are used to adjust the connections and weights between nodes in adjacent layers of the neural network to produce desired values at output nodes. This process is sometimes referred to as deep learning. The connections may extend from the input nodes individually towards the output nodes in a "feed forward" network, or may feed back the output of one layer to the input of the same or a different layer in a "recursive network".
Referring now to fig. 10, an exemplary embodiment of a node of a neural network is illustrated. A node may include, but is not limited to, a plurality of inputs xi that may receive values from inputs of a neural network that includes the node and/or from other nodes. The node may perform a weighted sum of the inputs using weights wi multiplied by the respective inputs xi. Additionally or alternatively, the bias b may be added to a weighted sum of inputs such that the bias is added to individual cells of the neural network layer independent of the inputs of that layer. The weighted sum may then be input into a function phi, which may generate one or more outputs y. The weight wi applied to the input xi may indicate whether the input is "excitatory", which indicates that it has a strong influence on the one or more outputs y, e.g. by a corresponding weight having a large value, and/or "inhibitory", which indicates that it has a weak influence on the one or more inputs y, e.g. by a corresponding weight having a small value. The value of the weight wi may be determined by training the neural network using training data, which may be performed using any suitable process as described above.
Still referring to fig. 10, a "convolutional neural network" as used in the present invention is a neural network in which at least one hidden layer is a convolutional layer that convolves the input of that layer with a subset of the inputs called "kernels" along with one or more additional layers, e.g., pooling layers, fully-connected layers, etc. CNNs may include, but are not limited to, deep Neural Network (DNN) extensions, where DNN is defined as a neural network having two or more hidden layers.
Fig. 11 is a system block diagram illustrating an exemplary decoder 1100. Decoder 1100 may include an entropy decoding processor 1104, an inverse quantization and inverse transform processor 1108, a deblocking filter 1112, a frame buffer 1116, a motion compensation processor 1120, and/or an intra prediction processor 1124.
In operation, still referring to fig. 11, the bitstream 1128 may be received by the decoder 1100 and input to the entropy decoding processor 1104, which may entropy decode a portion of the bitstream into quantized coefficients. The quantized coefficients may be provided to an inverse quantization and inverse transform processor 1108, which may perform inverse quantization and inverse transformation to create a residual signal, which may be added to the output of the motion compensation processor 1120 or the intra prediction processor 1124 depending on the processing mode. The outputs of the motion compensation processor 1120 and the intra prediction processor 1124 may include block prediction based on previously decoded blocks. The sum of the prediction and the residual may be processed by a deblocking filter 1112 and stored in a frame buffer 1116.
In one embodiment, still referring to fig. 11, decoder 1100 may include circuitry configured to implement any of the operations described above in any of the embodiments described above in any order and with any degree of repetition. For example, decoder 1100 may be configured to repeatedly perform a single step or sequence until the desired or commanded result is achieved; the repetition of a step or sequence of steps may be performed iteratively and/or recursively, using a previously repeated output as a subsequently repeated input, aggregating the repeated inputs and/or outputs to produce an aggregate result, thereby reducing or shrinking one or more variables (e.g., global variables) and/or partitioning a larger processing task into a set of iteratively addressed smaller processing tasks. A decoder may perform any step or sequence of steps as described herein in parallel, e.g., performing the step two or more times simultaneously and/or substantially simultaneously using two or more parallel threads, processor cores, etc.; the division of tasks between parallel threads and/or processes may be performed according to any protocol suitable for dividing tasks between iterations. Those skilled in the art will appreciate, upon review of the entire disclosure, that steps, sequences of steps, processing tasks, and/or data may be subdivided, shared, or otherwise processed in a variety of ways using iterative, recursive, and/or parallel processing.
Fig. 12 is a system block diagram illustrating an example video encoder 1200. The example video encoder 1200 may receive an input video 1204 that may be initially partitioned or divided according to a processing scheme such as a tree-structured macroblock partitioning scheme (e.g., a quadtree plus a binary tree). An example of a tree structure macroblock partitioning scheme may include partitioning a picture frame into large block elements called Coding Tree Units (CTUs). In some implementations, each CTU may be further partitioned one or more times into multiple sub-blocks called Coding Units (CUs). The end result of this partitioning may include a set of sub-blocks, which may be referred to as Prediction Units (PUs). A Transform Unit (TU) may also be utilized.
Still referring to fig. 12, an example video encoder 1200 may include: an intra prediction processor 1208; the motion estimation/compensation processor 1212, which may also be referred to as an inter prediction processor, is capable of constructing a motion vector candidate list, including adding global motion vector candidates to the motion vector candidate list; a transform/quantization processor 1216; an inverse quantization/inverse transform processor 1220; an in-loop filter 1224; a decoded picture buffer 1228; and/or an entropy encoding processor 1232. The bitstream parameters may be input to the entropy encoding processor 1232 for inclusion in the output bitstream 1236.
In operation, with continued reference to fig. 12, for each block of a frame of input video, it may be determined whether to process the block via intra-picture prediction or using motion estimation/compensation. The block may be provided to an intra prediction processor 1208 or a motion estimation/compensation processor 1212. If the block is to be processed via intra prediction, the intra prediction processor 1208 may perform processing to output a prediction value. If the block is to be processed via motion estimation/compensation, the motion estimation/compensation processor 1212 may perform processing that includes building a motion vector candidate list, including adding global motion vector candidates to the motion vector candidate list, if applicable.
With further reference to fig. 12, a residual may be formed by subtracting a predicted value from the input video. The residual may be received by a transform/quantization processor 1216, which may perform a transform process, such as a Discrete Cosine Transform (DCT), to generate coefficients that may be quantized. The quantized coefficients and any associated signaling information may be provided to an entropy encoding processor 1232 for entropy encoding and inclusion in the output bitstream 1236. The entropy encoding processor 1232 may support encoding of signaling information related to encoding the current block. In addition, the quantized coefficients may be provided to an inverse quantization/inverse transform processor 1220 that may reproduce the pixels, which may be combined with the predicted values and processed by an in-loop filter 1224, the output of which may be stored in a decoded picture buffer 1228 for use by a motion estimation/compensation processor 1212, the motion estimation/compensation processor 1212 being capable of constructing a motion vector candidate list, including adding global motion vector candidates to the motion vector candidate list.
With continued reference to fig. 12, although some variations have been described in detail above, other modifications or additions are possible. For example, in some implementations, the current block may include any symmetric block (8×8, 16×16, 32×32, 64×64, 128×128, etc.) as well as any asymmetric block (8×4, 16×8, etc.).
In some implementations, still referring to fig. 12, a quadtree plus binary decision tree (QTBT) may be implemented. In QTBT, at the coding tree unit level, partition parameters of QTBT may be dynamically derived to adapt to local characteristics without sending any overhead. Subsequently, at the coding unit level, the joint classifier decision tree structure can eliminate unnecessary iterations and control the risk of false predictions. In some implementations, the LTR frame block update mode may be used as an additional option available at each leaf node of QTBT.
In some implementations, still referring to fig. 12, additional syntax elements may be signaled at different levels of the bitstream. For example, the entire sequence may be enabled by including an enable flag encoded in a Sequence Parameter Set (SPS). Further, a Coding Tree Unit (CTU) flag may be encoded at a CTU layer.
Some embodiments may include a non-transitory computer program product (i.e., a physically embodied computer program product) storing instructions that, when executed by one or more data processors of one or more computing systems, cause at least one data processor to perform operations herein.
Still referring to fig. 12, encoder 1200 may include circuitry configured to implement any of the operations described above in any of the embodiments in any order and with any degree of repetition. For example, the encoder 1200 may be configured to repeatedly perform a single step or sequence until the desired or commanded result is achieved; the repetition of a step or sequence of steps may be performed iteratively and/or recursively, using a previously repeated output as a subsequently repeated input, aggregating the repeated inputs and/or outputs to produce an aggregate result, thereby reducing or shrinking one or more variables (e.g., global variables) and/or partitioning a larger processing task into a set of iteratively addressed smaller processing tasks. Encoder 1200 may perform any step or sequence of steps described herein in parallel, e.g., performing the step two or more times simultaneously and/or substantially simultaneously using two or more parallel threads, processor cores, etc.; the division of tasks between parallel threads and/or processes may be performed according to any protocol suitable for dividing tasks between iterations. Those skilled in the art will appreciate, upon review of the entire disclosure, that steps, sequences of steps, processing tasks, and/or data may be subdivided, shared, or otherwise processed in a variety of ways using iterative, recursive, and/or parallel processing.
With continued reference to fig. 12, a non-transitory computer program product (i.e., a physically embodied computer program product) may store instructions that, when executed by one or more data processors of one or more computing systems, cause at least one data processor to perform operations described in the present invention and/or steps thereof, including, but not limited to, any of the operations described above and/or any operations that the decoder 1200 and/or encoder 1200 may be configured to perform. Similarly, computer systems are also described that may include one or more data processors and memory coupled to the one or more data processors. The memory may store instructions that cause the at least one processor to perform one or more of the operations described herein, either temporarily or permanently. In addition, the method may be implemented by one or more data processors within a single computing system or distributed among two or more computing systems. Such computing systems may be connected via one or more connections, including connections over a network (e.g., the internet, a wireless wide area network, a local area network, a wide area network, a wired network, etc.), via direct connections between one or more of the plurality of computing systems, etc., and may exchange data and/or commands or other instructions, etc.
It should be noted that any one or more aspects and embodiments described herein may be conveniently implemented using one or more machines programmed according to the teachings of the present specification (e.g., one or more computing devices of a user computing device functioning as an electronic document, one or more server devices of a document server, etc.), as will be apparent to those of ordinary skill in the computer arts. As will be apparent to those of ordinary skill in the software art, a skilled programmer may readily prepare appropriate software code based on the teachings of the present invention. Aspects and implementations of the software and/or software modules discussed above may also include suitable hardware for facilitating implementation of the software and/or machine-executable instructions of the software modules.
Such software may be a computer program product employing a machine-readable storage medium. A machine-readable storage medium may be any medium that can store and/or encode a sequence of instructions for execution by a machine (e.g., a computing device) and that cause the machine to perform any one of the methods and/or embodiments described herein. Examples of machine-readable storage media include, but are not limited to, magnetic disks, optical disks (e.g., CD-R, DVD, DVD-R, etc.), magneto-optical disks, read-only memory "ROM" devices, random access memory "RAM" devices, magnetic cards, optical cards, solid-state memory devices, EPROM, EEPROM, and any combination thereof. A machine-readable medium as used herein is intended to include a single medium as well as a collection of physically separate media, such as a compressed disk or a collection of one or more hard disk drives in combination with computer memory. As used herein, a machine-readable storage medium does not include signal transmissions in a transitory form.
Such software may also include information (e.g., data) carried as data signals on a data carrier such as a carrier wave. For example, machine-executable information may be included as data-bearing signals embodied in a data carrier in which the signals encode sequences of instructions, or portions thereof, executed by a machine (e.g., a computing device), and any related information (e.g., data structures and data) which cause the machine to perform any one of the methods and/or embodiments described herein.
Examples of computing devices include, but are not limited to, electronic book reading devices, computer workstations, terminal computers, server computers, handheld devices (e.g., tablet computers, smartphones, etc.), network appliances, network routers, network switches, bridges, any machine capable of executing a sequence of instructions specifying an action to be taken by the machine, and any combination thereof. In one example, the computing device may include and/or be included in a kiosk.
FIG. 13 shows a diagrammatic representation of one embodiment of a computing device in the exemplary form of a computer system 1300 within which a set of instructions, for causing a control system to perform any one or more of the aspects and/or methodologies of the present invention, may be executed. It is also contemplated that a specially configured set of instructions for causing one or more devices to perform any one or more of the aspects and/or methods of the present invention may be implemented with multiple computing devices. Computer system 1300 includes a processor 1304 and memory 1308 that communicate with each other and with other components via bus 1312. The bus 1312 may comprise any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combination thereof using any of a variety of bus architectures.
The processor 1304 may include any suitable processor, such as, but not limited to, a processor that incorporates logic circuitry (e.g., an Arithmetic and Logic Unit (ALU)) for performing arithmetic and logic operations, which may be conditioned with a state machine and directed by operational inputs from memory and/or sensors; as a non-limiting example, the processor 1504 may be organized according to a von neumann and/or harvard architecture. The processor 1504 may include, be incorporated into, and/or be incorporated into, a microcontroller, a microprocessor, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Complex Programmable Logic Device (CPLD), a Graphics Processing Unit (GPU), a general purpose GPU, a Tensor Processing Unit (TPU), an analog or mixed signal processor, a Trusted Platform Module (TPM), a Floating Point Unit (FPU), and/or a system on a chip (SoC).
Memory 1308 may include various components (e.g., machine-readable media) including, but not limited to, random access memory components, read-only components, and any combination thereof. In one example, a basic input/output system 1316 (BIOS), containing the basic routines to transfer information between elements within the computer system 1300, such as during start-up, may be stored in memory 1308. The memory 1308 may also include instructions (e.g., software) 1320 that embody any one or more of the aspects and/or methods of the present invention (e.g., stored on one or more machine-readable media). In another example, memory 1308 may also include any number of program modules, including, but not limited to, an operating system, one or more application programs, other program modules, program data, and any combination thereof.
Computer system 1300 may also include a storage device 1324. Examples of a storage device (e.g., storage device 1324) include, but are not limited to, a hard disk drive, a magnetic disk drive, a combination of an optical disk drive and an optical medium, a solid state memory device, and any combination thereof. The storage device 1324 may be connected to the bus 1312 by a suitable interface (not shown). Example interfaces include, but are not limited to, SCSI, advanced Technology Attachment (ATA), serial ATA, universal Serial Bus (USB), IEEE 1394 (FIREWIRE), and any combination thereof. In one example, the storage device 1524 (or one or more components thereof) may be removably interfaced with the computer system 1300 (e.g., via an external port connector (not shown)). In particular, the storage device 1324 and associated machine-readable medium 1328 may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 1300. In one example, the software 1320 may reside, completely or partially, within the machine-readable medium 1328. In another example, the software 1320 may reside wholly or partially within the processor 1304.
Computer system 1300 may also include an input device 1332. In one example, a user of computer system 1300 can enter commands and/or other information into computer system 1300 via input device 1332. Examples of input devices 1332 include, but are not limited to, an alphanumeric input device (e.g., keyboard), a pointing device, a joystick, a game pad, an audio input device (e.g., microphone, voice response system, etc.), a cursor control device (e.g., mouse), a touchpad, an optical scanner, a video capture device (e.g., still camera, video camera), a touch screen, and any combination thereof. The input device 1332 may interface to the bus 1312 via any of a variety of interfaces (not shown), including, but not limited to, a serial interface, a parallel interface, a game port, a USB interface, a FIREWIRE interface, a direct interface to the bus 1312, and any combination thereof. The input device 1332 may include a touch screen interface, which may be part of or separate from the display 1336, discussed further below. The input device 1332 may be used as a user selection device for selecting one or more graphical representations in a graphical interface as described above.
A user may also enter commands and/or other information into the computer system 1300 via storage device 1324 (e.g., removable disk drive, flash memory drive, etc.) and/or network interface device 1340. Network interface devices, such as network interface device 1340, may be used to connect computer system 1300 to one or more of a variety of networks, such as network 1344, and one or more remote devices 1348 connected thereto. Examples of network interface devices include, but are not limited to, network interface cards (e.g., mobile network interface cards, LAN cards), modems, and any combination thereof. Examples of networks include, but are not limited to, wide area networks (e.g., the internet, enterprise networks), local area networks (e.g., networks associated with offices, buildings, campuses, or other relatively small geographic spaces), telephony networks, data networks associated with telephony/voice providers (e.g., mobile communications provider data and/or voice networks), direct connections between two computing devices, and any combination thereof. Networks, such as network 1344, may employ wired and/or wireless modes of communication. In general, any network topology may be used. Information (e.g., data, software 1320, etc.) may be transferred to and/or from computer system 1300 via network interface device 1340.
Computer system 1300 may also include a video display adapter 1352 for transferring displayable pictures to a display device, such as display device 1336. Examples of display devices include, but are not limited to, liquid Crystal Displays (LCDs), cathode Ray Tubes (CRTs), plasma displays, light Emitting Diode (LED) displays, and any combination thereof. The display adapter 1352 and the display device 1336 may be used in combination with the processor 1304 to provide a graphical representation of aspects of the present invention. In addition to the display device, computer system 1300 may also include one or more other peripheral output devices including, but not limited to, audio speakers, a printer, and any combination thereof. Such peripheral output devices can be connected to bus 1312 via peripheral interface 1356. Examples of peripheral interfaces include, but are not limited to, serial ports, USB connections, FIREWIRE connections, parallel connections, and any combination thereof.
The foregoing is a detailed description of illustrative embodiments of the invention. Various modifications and additions may be made without departing from the spirit and scope of the invention. The features of each of the various embodiments described above may be combined with the features of the other described embodiments as appropriate to provide various feature combinations in the associated new embodiments. Furthermore, while the above describes a number of individual embodiments, what is described herein is merely illustrative of the application of the principles of the invention. Additionally, although particular methods herein may be illustrated and/or described as being performed in a particular order, the ordering is highly variable within the ordinary skill of implementing the methods, systems, and software in accordance with the invention. Accordingly, the description is intended to be illustrative only and not to be in any way limiting of the scope of the invention.
Exemplary embodiments have been disclosed above and illustrated in the accompanying drawings. It will be appreciated by those skilled in the art that various changes, omissions and additions may be made to the details disclosed herein without departing from the spirit and scope of the invention.

Claims (22)

1. A method of encoding video, comprising:
Extracting a plurality of features in a picture in a video frame;
Grouping at least a portion of the plurality of features into at least one object;
determining a region for the at least one object;
assigning an object identifier to the at least one object; and
The object identifier is encoded into a bitstream.
2. The method of claim 1, wherein the plurality of features are extracted using a feature model.
3. The method of claim 1, wherein the region is represented by a geometric representation.
4. A method according to claim 3, wherein the geometric representation is one of a bounding box or a contour.
5. The method of claim 4, wherein the object identifier comprises a region identifier and a tag.
6. The method of claim 5, wherein when the geometric representation is a bounding box, the bounding box is identified by a particular angle of the bounding box and coordinates of a width and a height.
7. The method of claim 5, wherein when the geometric representation is a bounding box, the bounding box is identified by coordinates of the two diagonally opposite corners.
8. The method of claim 5, wherein the geometric representation is a contour and the contour is represented by a set of consecutive angles.
9. The method of claim 1, wherein objects are further evaluated over a sequence of frames to determine events, event identifiers are associated with objects, and the event identifiers are encoded into the bitstream.
10. The method of claim 1, wherein the object identifier is inserted into the bitstream as supplemental enhancement information.
11. The method of claim 1, wherein the bitstream includes slice header information, and the slice header information is used to signal the presence of objects in a given slice.
12. The method of claim 1, further comprising:
Generating a correlation map for the extracted features;
Determining a relevance score for a portion of the picture using the relevance map; and
The portion of the picture is encoded with a bit rate determined at least in part by the relevance score.
13. A method of encoding video, comprising:
Extracting a set of features from a picture in a video;
Generating a correlation map for the extracted features;
Determining a relevance score for a portion of the picture using the relevance map; and
The portion of the picture is encoded with a bit rate determined at least in part by the relevance score.
14. The method of claim 12, wherein the picture is represented by a plurality of coding units and the correlation map is determined at a coding unit level, wherein each coding unit has a coding unit correlation score.
15. The method of claim 13, wherein the encoding comprises assigning a bit rate to each coding unit.
16. The method of claim 14, wherein the relevance score comprises a relative relevance score for each coding unit.
17. The method of claim 15, wherein the encoding comprises at least one of intra-prediction, motion estimation, and transform quantization, and wherein the relative correlation score is used in an explicit rate-distortion optimization mode to alter the encoding during processing of the at least one of intra-prediction, motion estimation, and transform quantization.
18. The method of claim 15, wherein the relative correlation score is used for a rate distortion function to determine an adjusted bit rate for each coding unit.
19. The method of claim 13, further comprising:
Grouping at least a portion of the extracted features into at least one object;
determining a region for the at least one object;
assigning an object identifier to the at least one object; and
Encoding the object identifier into the bitstream.
20. An encoded video bitstream comprising:
Encoded video content data, the video content comprising at least one object identified by an encoder, the encoder extracting a plurality of features in a picture in the video content;
at least one object identifier and associated object annotation; and
At least one event identifier and associated event annotation.
21. The bitstream of claim 20, further comprising a Supplemental Enhancement Information (SEI) message, wherein information related to the at least one object and at least one event is signaled in the SEI message.
22. The bitstream of claim 20, further comprising slice header information, wherein information related to at least one object and at least one event in a video slice is signaled in the slice header information.
CN202280084404.6A 2021-11-04 2022-10-26 System and method for feature-based rate-distortion optimization for object and event detection and for video coding Pending CN118414829A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163275740P 2021-11-04 2021-11-04
US63/275,740 2021-11-04
US63/275,700 2021-11-04
PCT/US2022/047829 WO2023081047A2 (en) 2021-11-04 2022-10-26 Systems and methods for object and event detection and feature-based rate-distortion optimization for video coding

Publications (1)

Publication Number Publication Date
CN118414829A true CN118414829A (en) 2024-07-30

Family

ID=91983581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280084404.6A Pending CN118414829A (en) 2021-11-04 2022-10-26 System and method for feature-based rate-distortion optimization for object and event detection and for video coding

Country Status (1)

Country Link
CN (1) CN118414829A (en)

Similar Documents

Publication Publication Date Title
US20240357142A1 (en) Video and feature coding for multi-task machine learning
US20240107088A1 (en) Encoder and decoder for video coding for machines (vcm)
US20240236342A1 (en) Systems and methods for scalable video coding for machines
EP4480174A1 (en) Systems and methods for video coding for machines using an autoencoder
US20240283942A1 (en) Systems and methods for object and event detection and feature-based rate-distortion optimization for video coding
CN118414829A (en) System and method for feature-based rate-distortion optimization for object and event detection and for video coding
US20240357107A1 (en) Systems and methods for video coding of features using subpictures
US20240267531A1 (en) Systems and methods for optimizing a loss function for video coding for machines
US20240340391A1 (en) Intelligent multi-stream video coding for video surveillance
US20240291999A1 (en) Systems and methods for motion information transfer from visual to feature domain and feature-based decoder-side motion vector refinement control
US20240430464A1 (en) Systems and methods for coding and decoding image data using general adversarial models
CN118119951A (en) System and method for joint optimization training and encoder-side downsampling
EP4388456A1 (en) Systems and methods for joint optimization training and encoder side downsampling
WO2023137003A1 (en) Systems and methods for privacy protection in video communication systems
WO2025007083A1 (en) Systems and method for decoded frame augmentation for video coding for machines
CN118614062A (en) System and method for transferring motion information from vision to feature domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination