GB2624947A

GB2624947A - Enhancement decoding implementation and method

Info

Publication number: GB2624947A
Application number: GB2304371.4A
Authority: GB
Inventors: Clucas Richard; Firman Lucy; Middleton Colin; Kelly Adam
Original assignee: V Nova International Ltd
Current assignee: V Nova International Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2024-06-05
Also published as: GB202304371D0; WO2024201008A1

Abstract

Implementing an enhancement decoding (for video coding) comprising: obtaining 23 a preliminary set of residuals from an encoded enhancement signal representative of one or more layers of residual data, adding the preliminary set of residuals to a temporal buffer 24, the temporal buffer storing a set of temporal residual values; and, storing a representation 37 of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values. The representation may be a collection of flags, each corresponding to a transform block or collection of pixels in a tile or larger block, indicating whether any of the pixels in the transform block is non-zero (or whether the transform block is just a zero block). This allows selective memory access commands to retrieve only non-zero data when needing to apply 31 the residual data to an upscaled base frame 28 for output of enhanced video data.

Description

ENHANCEMENT DECODING IMPLEMENTATION AND METHOD BACKGROUND

A hybrid backward-compatible coding technology has been previously proposed, for example in W02013/171173, W02014/170819, WO 2019/141987, and WO 2018/046940, the contents of which are incorporated herein by reference. Further examples of tier-based coding formats include ISO/IEC MPEG-5 Part 2 LCEVC (hereafter 'LCEVC'). LCEVC has been described in WO 2020/188273A1, GB 2018723.3, WO 2020/188242, and the associated standard specification documents including ISO/IEC 23094-2 MPEG Part 2 Low Complexity Enhancement Video Coding (LCEVC), First edition October 2021 all of these documents being incorporated by reference herein in their entirety.

In these coding formats a signal is decomposed in multiple "echelons" (also known as "hierarchical tiers") of data, each corresponding to a "Level of Quality", from the highest echelon at the sampling rate of the original signal to a lowest echelon. The lowest echelon is typically a low-quality rendition of the original signal and other echelons contain information on correction to apply to a reconstructed rendition in order to produce the final output.

LCEVC adopts this multi-layer approach where any base codec (for example Advanced Video Coding -AVC, also known as H.264, or High Efficiency Video Coding -HEVC, also known as H.265) can be enhanced via an additional low bitrate stream. LCEVC is defined by two component streams, a base stream typically decodable by a hardware decoder and an enhancement stream consisting of one or more enhancement layers suitable for software processing implementation with sustainable power consumption.

In the specific LCEVC example of these tiered formats, the process works by encoding a lower resolution version of a source image using any existing codec (the base codec) and the difference between the reconstructed lower resolution image and the source using a different compression method (the enhancement).

The remaining details that make up the difference with the source are efficiently and rapidly compressed with LCEVC, which uses specific tools designed to compress residual data. The LCEVC enhancement compresses residual information on at least two layers, one at the resolution of the base to correct artefacts caused by the base encoding process and one at the source resolution that adds details to reconstruct the output frames. Between the two reconstructions the picture is optionally upscaled using either a normative up-sampler or a custom one specified by the encoder in the bitstream. In addition, LCEVC also performs some non-linear operations called residual prediction, which further improve the reconstruction process preceding residual addition, collectively producing a low-complexity smart content-adaptive (i.e., encoder driven) upscaling.

Since LCEVC and similar coding formats leverage existing decoders and are inherently backwards-compatible, there exists a need for efficient and effective integration with existing video coding implementations without complete re-design. While the LCEVC standard is published and well-known, there is no public information about how to implement a decoding, for example in a chipset.

The approach of LCEVC being a codec agnostic enhancer based on a software-driven implementation, which leverages available hardware acceleration, also shows in the wider variety of implementation options on the decoding side. While existing decoders are typically implemented in hardware at the bottom of the stack, LCEVC basically allows for implementation on a variety of levels i.e., from Scripting and Application to the OS and Driver level and all the way to the SoC and ASIC. In other words, there is more than one solution to implement LCEVC on the decoder side. Generally speaking, the lower in the stack the implementation takes place, the more device specific the approach becomes. Except for an implementation on ASIC level, no new hardware is needed.

Encoding and decoding may be used to compress and/or secure content communicated over a network, such as in a streaming service. Alternatively, encoding and decoding may be used in other data transportation/transmission contexts such as physical media (e.g. DVDs, portable flash memory).

The encoder may for example be implemented at a content creation, content distribution or content streaming service.

The decoder may for example be implemented in consumer hardware for viewing decoded content, such as in a display device (e.g. a television), or in a separate device for receiving encoded content and supplying decoded content to the display device (e.g. a set-top box or a DVD player).

It is desirable, at least in the short term, to implement LCEVC in a simple manner using existing architectures and designs and there are a variety of hardware limitations which need to be addressed, such as for example low memory bandwidth.

Innovations and optimisations are sought which address the limitations of video decoder chipsets and facilitate and improve the introduction and implementation of enhancement decoders, such as LCEVC, into the wider video decoder ecosystems.

Retrofitting old chipsets with the ability to decode a coding scheme that was released after the production of the chipset is completely unprecedented. This means that even though there is the 'potential' to retrofit LCEVC to 'old' or legacy chipsets, it is still poses many engineering challenges.

In a particular example, an LCEVC decoding process involves the process of decoding a set of values from a received bytestream, updating a temporal buffer using those values and the combining the updated temporal buffer values with those picture elements decoded using a base codec (and optionally upscaled). Each of these operations involves cumbersome memory operations. Moreover, using the LCEVC standard as an instruction manual would lead to each step of the processing pipeline being implemented as a separate module with communications to and from memory by each module, exacerbating the issues with entire frame-level memory operations. Reducing the type and volume of memory operations has the potential to have a hugely beneficial impact on the implementation.

SUMMARY OF INVENTION

According to an aspect of the invention there may be provided a method of implementing an enhancement decoding, comprising: obtaining a preliminary set of residuals from an encoded enhancement signal representative of one or more layers of residual data, adding the preliminary set of residuals to a temporal buffer, the temporal buffer storing a set of temporal residual values; and, storing a representation of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values.

The representation provides implementation efficiencies by reducing memory bandwidth and computational load.

Moreover, the representation may be smaller and easier to move around than the temporal buffer. This has savings because it means the process does not have 15 to use up the memory bandwidth or perform the computation to do the operation of applying residuals in that particular area, however large that area is.

In preferred examples, a plane of residuals is not created from the temporal buffer for combination with a plane of (upscaled) base data. This is facilitated by the representation and thus reduces memory bandwidth. In further implementations, modules of the pipeline may be combined to reduce the passing of data between them and simply refer to the representation to understand if data should be applied or not, i.e. used.

The step of obtaining a preliminary set of residuals may be thought of a receiving or retrieving residuals from a stream or data store and generally obtaining decoded residuals in any form suitable for updating the temporal buffer.

The one or more layers of residual data may be in the form of a residual map.

By adding, we mean updating, storing, writing or overwriting a value stored in the temporal buffer. This does not necessarily include all values in the buffer but at least one value in the buffer, as indicated by the decoded enhancement signal Each temporal residual value may correspond to a picture element of a frame of an original input video signal, that is, each temporal residual has an inherent location in a map. By temporal residual values, we may not mean any particular limitation or function but simply use a label to refer to this residual to functionally differentiate from other residuals, such as the residuals plane which may no longer be created, and the preliminary set of residuals decoded from the enhancement signal.

The indicating may refer to positively indicating where the value is zero and therefore implicitly indicating non-zero, or vice versa. The representation may be in the form of flags. The flags may be used to tell the modules of a decoder where data needs to be applied. This is particularly beneficial in terms of reducing memory bandwidth and computational load.

The representation may be thought of as leveraging the sparsity of the temporal buffer, that is, the temporal buffer is a sparse matrix and typically includes many logical zero values which do not necessarily need to be written or applied in practice.

Preferably the method comprises: combining a reconstructed version of an original input video signal with the temporal residual values based on whether a corresponding element in the representation indicates there is non-zero data in a respective portion of the set of temporal residual values.

Examples of the concepts disclosed herein combine multiple stages of the LCEVC processing pipeline in a way that saves memory bandwidth and other resources In particular, there may be no need to generate a residual plane, saving an entire resolution sized image worth of memory bandwidth. This is particularly useful when retrofitting this scheme to old hardware.

In a typical step-by-step implementation of an enhancement coding, the pipeline generates and then writes a residual plane out in the module which updates the temporal buffer, and then reads it (shortly after) again in the module which combines the data with the (upscaled) base. The present disclosure reduces this inefficiency.

Moreover, a large proportion of values in residual plane are logical zero, perhaps as much as 90%, thus 90% of the reading and writing of the residual plane is essentially pointless as it will not be changing the base. This is addressed by the concepts herein The format of the temporal buffer can be arbitrary. Typically, in implementations it is in the same format as the image. This introduces a problem in that the blocks of data square, i.e. they are not horizontal lines. Therefore if the buffer is stored in a rasterised way and there is a big image, the difference in memory from one line to the next is going to be quite far away causing a cache issue. This is mitigated by the utilisation of the representation indicating where data in the buffer should be used or not read.

In other words, the temporal buffer in typical implementations may need to have a residual stored for each pixel of the final image. The inputs to the temporal buffer are Command buffers describing changes to Transform Units (2x2 or 4x4 in size).The output may then be the final image, which is a rasterised image. That is, all the pixels of one row then all the pixels in the next row, etc. This means that when copying in 4x4 blocks, an implementation process reads small amounts of data from four different sections of memory as different rows are separated by a lot in memory (for HD images the pixel below will be 1920 pixels distance away in memory).

In some implementations a process may store the temporal buffer as a rasterised image, so it becomes convenient to apply it to the output. However, if there are areas that are logical zero, the control process needs to know where they are, i.e. from the representation, and will only need to use that pixel.

Therefore, the proposed process involves storing the temporal buffer in such a way so that it is easy to apply piecemeal, i.e. one 4x4 block at a time.

The step of combining may comprise reading a value in the temporal buffer in response to determining that the representation indicates the value is non-zero.

In this way, where the representation indicates there is no value to be added, the reading of the temporal buffer can be skipped, reducing memory operations and computational load.

The reconstructed version may be an upscaled rendition of a decoded rendition of the original input video signal encoded according to a base codec.

Where the representation indicates a value in the set of temporal residual values is non-zero, the method may comprise: reading the value in the set of temporal residual values from the temporal buffer; reading a corresponding picture element in the reconstructed version of an input video; and, combining the value and the picture element.

Optionally, the method may comprise writing the combination to the memory location of the picture element, the reconstructed version being a plane of video. Alternatively the method may comprise writing the combination to a plane stored in a further memory location.

Where the representation indicates a value in the set of temporal residual values is zero, the method may further comprise: reading a corresponding picture element in the reconstructed version of an input video; writing the corresponding a picture element to a plane stored in a new memory location.

Alternatively, where the representation indicates a value in the set of temporal residual values is zero, the method may further comprise: continuing to process a next value in the set of temporal residual values and not modifying a corresponding picture element in the reconstructed version of an input video.

Embodiments may depend on the capabilities available in the chipset on which it is implemented. The implementation in which the base data is overwritten is particularly beneficial, that is, if the temporal flag is false, the pipeline does not need to do anything (rather, it moves on to next area) because it is overwriting/updating the 'original' or received base, rather than generating a new plane.

The method may further comprise, after processing each value in the set of temporal residual values based on whether the representation, indicates a value in the set of temporal residual values is zero or non-zero, outputting the plane to an output path. Thus, the final output plane may be sent for rendering.

Preferably, the method may further comprise: generating a set of commands for updating the temporal residual values based on the preliminary set of residuals; and, adding the preliminary set of residuals to the temporal buffer based on the commands. The commands may be an efficient way of updating the temporal buffer. In preferred examples the representation may be stored in a corresponding order to the commands, thus simplifying operations and memory storage.

The one or more layers of residual data may be generated based on a comparison of data derived from a decoded video signal and data derived from an original input video signal. The residual data may be structured according to the LCEVC standard.

The portion of the set of temporal residual values may comprises a plurality of the set of temporal residual values. Thus the representation may be smaller and easier to move around than that the temporal buffer The portion may be a block of the set of temporal residual values. The block may be a 2 x 2 or 4 x 4 block and optionally may correspond to a transform unit or the portion may be a tile, such as a 32 x 32 tile corresponding to multiple block or multiple transform units. In this way, efficient processing of the temporal buffer may be performed. Optionally the size of the representation may be variable and may be set based on the nature of the data, i.e. its sparseness.

In preferred examples, the elements of the representation are stored in an order of transform units received in the enhancement signal.

According to further aspect of the invention there may be provided a video decoder configured to decode an encoded enhancement signal, the video decoder comprising: a residuals decoding module configured to obtain a preliminary set of residuals from the encoded enhancement signal representative of one or more layers of residual data; and, a residuals processing module configured to: add the preliminary set of residuals to a temporal buffer, the temporal buffer storing a set of temporal residual values; and, store a representation of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values.

The video decoder may further comprise: an apply residuals module configured to combine a reconstructed version of an original input video signal with the temporal residual values based on whether a corresponding element in the representation indicates there is non-zero data in a respective portion of the set of temporal residual values. The apply residuals module may be configured to: read a value in the temporal buffer in response to determining that the representation indicates the value is non-zero.

VVhere the representation indicates a value in the set of temporal residual values is non-zero, the apply residuals module may be configured to: read the value in the set of temporal residual values from the temporal buffer; read a corresponding picture element in the reconstructed version of an input video; and, combine the value and the picture element.

The apply residuals module may be configured to: write the combination to the memory location of the picture element, the reconstructed version being a plane of video. Alternatively, the apply residuals module may be configured to: write the combination to a plane stored in a further memory location.

1A/here the representation indicates a value in the set of temporal residual values is zero, the apply residuals module may be configured to: read a corresponding picture element in the reconstructed version of an input video; write the corresponding a picture element to a plane stored in a new memory location.

Where the representation indicates a value in the set of temporal residual values is zero, the apply residuals module may be configured to: continue to process a next value in the set of temporal residual values and not modifying a corresponding picture element in the reconstructed version of an input video.

After processing each value in the set of temporal residual values based on whether the representation indicates a value in the set of temporal residual values is zero or non-zero, the apply residuals module may be configured to output the plane to an output path.

The residuals decoder module may be configured to: generate a set of commands for updating the temporal residual values based on the preliminary set of residuals; and, add the preliminary set of residuals to the temporal buffer based on the commands.

The one or more layers of residual data may be generated based on a comparison of data derived from a decoded video signal and data derived from an original input video signal.

The portion of the set of temporal residual values may comprise a plurality of the set of temporal residual values. The portion may be a block of the set of temporal residual values. The elements of the representation may be stored in an order of transform units received in the enhancement signal.

According to further aspects of the invention there may be provided a chipset for decoding an encoded enhancement video signal, the chipset configured to perform the above method.

Aspects of the invention may be embodied by a graphics processing unit (GPU), central processing unit (CPU), chipset or other hardware decoder configured to perform the above method. For example, a processer may be configured to execute a set of instructions to perform the method of any of the above aspects. As another example, a chipset may comprise at least one ASIC adapted to perform all or part of the above method. The chipset may further comprise a memory storing instructions which, when executed by one or more processors, cause the processors to perform a further part of the above method.

According to further aspects of the invention there may be provided a set-top box comprising a decoder as described above.

According to further aspects of the invention there may be provided a computer readable medium comprising instructions which when executed by a processor perform the method of the above method.

According to further aspects of the invention there may be provided a method of retrofitting a decoder, chipset or set-top box to provide a decoder, chipset or set-top box configured to perform the above method. For example, after a customer purchases a device, the device may be retrofitted by means of the manufacturer providing an update for the device, wherein performing the update constitutes making a device having the features of one of the preceding aspects.

BRIEF DESCRIPTION OF DRAWINGS

Examples of systems and methods in accordance with the invention will now be described with reference to the accompanying drawings, in which: Figure 1 shows a known, high-level schematic of an LCEVC decoding process; Figure 2 shows an implementation pipeline for processing an enhancement signal; Figure 3 shows an implementation pipeline for processing an enhancement signal according to examples of the present disclosure; Figure 4 shows a further implementation pipeline for processing an enhancement signal according to examples of the present disclosure Figure 5 shows a flow diagram of an example of the present disclosure; Figure 6 shows a further implementation pipeline for processing an enhancement signal according to examples of the present disclosure Figure 7 shows a flow diagram of an example of the present disclosure; and, Figure 8 shows a flow diagram of an example of the present disclosure.

DETAILED DESCRIPTION

This disclosure describes an implementation for integration of a hybrid backward-compatible coding technology with existing decoders, optionally via a software update. In a non-limiting example, the disclosure relates to an implementation and integration of MPEG-5 Part 2 Low Complexity Enhancement Video Coding (LCEVC). LCEVC is a hybrid backward-compatible coding technology which is a flexible, adaptable, highly efficient and computationally inexpensive coding format combining a different video coding format, a base codec (i.e. an encoder-decoder pair such as AVC/H.264, HEVC/H.265, or any other present or future codec, as well as non-standard algorithms such as VP9, AV1 and others) with one or more enhancement levels of coded data.

Implementations described herein may also be suitable for other hierarchical 15 coding schemes such as VC-6 or SHEVC.

Although one or more embodiments have been described in relation to LCEVC, aspects of the invention may also be implemented in other hierarchical coding schemes. In such embodiments, the 'base encoder' may correspond to an encoder configured to encode a low layer (corresponding to a low quality) of a hierarchical coding scheme. In such embodiments, the 'enhancement encoder' may correspond to an encoder configured to encode a high layer (corresponding to a high quality, i.e. a higher quality than the low layer) of the hierarchical coding scheme. For example, the base encoder may correspond to a encoder configured to encode a lowest layer of the hierarchical coding scheme and the enhancement encoder may correspond to a encoder configured to encoding a (e.g. first) enhancement layer of the hierarchical coding scheme. More generally, the base encoder may correspond to a encoder configured to encode a nth layer of a hierarchical coding scheme and the enhancement encoder may correspond to a encoder configured to encode a (n+1)th layer of the hierarchical coding scheme.

A base encoder may itself be a multi-layer encoder In particular, a base encoder may comprise a base encoder and one or more enhancement encoders.

An enhancement encoder may comprise multiple encoders. In particular, an 5 enhancement encoder may comprise multiple enhancement encoders.

A base encoding may be the output of a one or more layers of encoding. For example, a base encoding may be a first coding layer combined one or more further (e.g. enhancement) coding layers.

An enhancement encoding may comprise one or more layers of enhancement.

For example, a base encoding may a base layer (i.e. output by a single layer codec such as HEVC, VVC, and so forth) combined with a first layer of LCEVC residuals, whilst the enhancement encoding may be a second layer of LCEVC residuals. In a further example, a base encoding may a lowest layer encoded in accordance with the SMPTE VC-6 standard combined with one or more VC-6 enhancement layers, whilst the enhancement encoding may be one or more 'higher' layer enhancement layers of the VC-6 standard.

Example hybrid backward-compatible coding technologies use a down-sampled source signal encoded using a base codec to form a base stream. An enhancement stream is formed using an encoded set of residuals which correct or enhance the base stream for example by increasing resolution or by increasing frame rate. There may be multiple levels of enhancement data in a hierarchical structure. In certain arrangements, the base stream may be decoded by a hardware decoder while the enhancement stream may be suitable for being processed using a software implementation. Thus, streams are considered to be a base stream and one or more enhancement streams, where there are typically two enhancement streams possible but often one enhancement stream used. It is worth noting that typically the base stream may be decodable by a hardware decoder while the enhancement stream(s) may be suitable for software processing implementation with suitable power consumption Streams can also be considered as layers.

The video frame is encoded hierarchically as opposed to using block-based approaches as done in the M PEG family of algorithms. Hierarchically encoding a frame includes generating residuals for the full frame, and then a reduced or decimated frame and so on. In the examples described herein, residuals may be considered to be errors or differences at a particular level of quality or resolution.

For context purposes only, as the detailed structure of LCEVC is known and set out in the approved draft standards specification, Figure 1 illustrates in a logical flow how LCEVC operates on the decoding side assuming H.264 as the base codec. Those skilled in the art will understand how the examples described herein are also applicable to other multi-layer coding schemes (e.g., those that use a base layer and an enhancement layer) based on the general description of LCEVC that is presented with reference to Figure 1. Turning to Figure 1, the LCEVC decoder 10 works at individual video frame level. It takes as an input a decoded low-resolution picture from a base (H.264 or other) video decoder 11 and the LCEVC enhancement data to produce a decoded full-resolution picture ready for rendering on the display view. The LCEVC enhancement data is typically received either in Supplemental Enhancement Information (SEI) of the H.264 Network Abstraction Layer (NAL), or in an additional data Packet Identifier (RID) and is separated from the base encoded video by a demultiplexer 12. Hence, the base video decoder 11 receives a demultiplexed encoded base stream and the LCEVC decoder 10 receives a demultiplexed encoded enhancement stream, which is decoded by the LCEVC decoder 10 to generate a set of residuals for combination with the decoded low-resolution picture from the base video decoder 11.

LCEVC can be rapidly implemented in existing decoders with a software update and is inherently backwards-compatible since devices that have not yet been updated to decode LCEVC are able to play the video using the underlying base codec, which further simplifies deployment.

In this context, there is proposed herein a decoder implementation to integrate decoding and rendering with existing systems and devices that perform base decoding. The integration is easy to deploy. It also enables the support of a broad range of encoding and player vendors, and can be updated easily to support future systems. Embodiments of the invention specifically relate to how to implement LCEVC in such a way as to provide for decoding of protected content in a secure manner In specific embodiments, the invention may be manifested as chipset driver software. This could thus be used to build next generation chips or could be used to retrofit old chips to enable them to perform better than they can do now (without need for new hardware).

The proposed decoder implementation may be provided through an optimised software library for decoding MPEG-5 LCEVC enhanced streams, providing a simple yet powerful control interface or API. This allows developers flexibility and the ability to deploy LCEVC at any level of a software stack, e.g. from low-level command-line tools to integrations with commonly used open-source encoders and players. In particular, embodiments of the present invention generally relate to a driver-level implementations and a System on a chip (SoC) level implementation.

The terms LCEVC and enhancement may be used herein interchangeably, for 20 example, the enhancement layer may comprise one or more enhancement streams, that is, the residuals data of the LCEVC enhancement data.

Figure 2 illustrates an implementation of enhancement decoding in which each step of the processing pipeline is implemented in a separate module. That is, by analogy, using the standard as an instruction manual.

The figure illustrates as input LCEVC data 22, that is, enhancement data. Consistent with figure 1, such enhancement data is encoded residual data usable to be combined with a decoded rendition of a base stream to reconstruct an original input video.

As indicated in the above-described figures, LCEVC streams are split into enhancement data and base data. In the illustration of figure 2, schematically it is shown that the process receives as inputs LCEVC compressed data 22 and a base uncompressed frame 27. With cross-reference to figure 1 we can see that the LCEVC decoder 10 receives the base uncompressed data from the base decoder 11 and the LCEVC compressed data from the multiplexer 12. We focus here on the enhancement layer process pipeline, rather than the base layer pipeline. The base layer receives base data, decodes that data using a base codec, optionally upscales that data to generate a rendition of the original input, stored in memory. We say upscaling is optional since the method may still correct for errors in the base even if the resolution, or other scaling such as bit depth, is not changed through upscaling. The LCEVC data 22 is typically binary data.

The LCEVC data is first parsed. In this illustrated example this is in the form of a residual decoder module 23 which functions to decode a set of values from the enhancement stream. In examples the parsing function may also control operation of the upscaler. The parsing function may be in the form of a decoder plug-in under control of a decoder integration layer, as set out in W02022/023747, the contents of which are incorporated herein by reference.

As set out in the LCEVC standard, the LCEVC residuals data is generated in the form of a temporal buffer 24 and a set of preliminary residuals to be applied to the buffer. The preliminary set of residuals stored in the stream may be referred to as 'deltas' when the temporal function is being used in LCEVC, i.e. the stream comprises the 'deltas'. A delta is the difference between the residual in the temporal buffer and the 'true/calculated' residual for that frame. in some cases they will be residuals and in others they will be modifying residuals. Throughout the present description we may refer to preliminary set of residuals or 'deltas' interchangeably. Deltas can be seen as second order residuals, as they are a different between a residual in the temporal buffer and a 'true' residual calculated for that frame.

That is, the residuals from the previous frame are stored in the temporal buffer 24 and a difference between the elements in the buffer and the elements of the frame are typically received in the stream (i.e. in entropy coded, transformed and quantised form).

To implement the decoder, the temporal buffer stores the residuals of the previous frame and the deltas decoded from the stream are applied to the temporal buffer to create the frame of residuals that are applied to the base decoder frame to generate the surface.

More information on temporal signalling in LCEVC can be found in the LCEVC standard specification, ISO/I EC 23094-2:2021(en) Low Complexity Enhancement Video Coding, and W02020/089618, which are both incorporated herein by reference in their entirety.

Conceptually, this process is illustrated in figure 2 as the LCEVC compressed data 22 is parsed by a decoding module 23 to derive the set of temporal preliminary residuals which are then combined with the temporal buffer 24 to create the frame of residuals data 25 for rendering In most implementations, LCEVC data is used to create a plane (or image, frame or surface, see clarification of terminology below) that contains all of the residuals and all of the places where there are no residuals as well. This may be referred to as a residual map. In this respect, often the plane, i.e. the map, has many logical zeros. Throughout the present description when we refer to zero we mean a logical zero that does not affect the image even if it may not be numerical zero.

That is, when the map is passed as an image to some sort of image processing block, whether that be a hardware block, a CPU shader, or a CPU function, that plane is added to an optionally upscaled plane, to get the final image.

Throughout the present application we will refer to the terms frame, image, plane and surface as interchangeable terms for a map or 2D array of elements or pixel elements representing a part of a video, i.e. the 2D set of pixel elements that combine to create the video. In typical implementations there is only one plane, such as Luma, but multiple planes may be supported. Generally, the terms map and surface are naming conventions.

In the present description, when we refer to passing data or sending data, in practice the data may not be sent but instead a pointer to memory or simply informing another logical component of the decoder about that data.

In optional implementations, the residual decoder provides a list of operations or commands 26 needed to change the last frame's residual image (or residual map) to the current residual image. By map, again we mean the values of a frame or surface are stored in a manner in which the location of the values in the frame or surface can be identified.

Examples of configured operations (i.e. commands) include SET, CLEAR and APPLY A SET operation instructs the pipelines who set or change the values of a block to specific values. A CLEAR operation instructs the pipeline to clear or delete the values of a 32 x 32 tile containing multiple blocks and an APPLY operation instructs the pipeline to combine the values of a block with the values already in the buffer at that location. Commands may be sent individually or grouped together.

In sum, in a first step, the bytestream is decoded to generate preliminary residual data. In a second, optional step, the bytestream data is converted into a set of commands 26 to be applied to a temporal buffer 24, the temporal buffer 24 storing the last frame's data Conceptually, the next module functions to read the temporal buffer and perform commands to update the temporal buffer 24 to store the updated or modified residual value. We may refer to this module 29 as the residual extractor or processing module.

The processing module 29 generates a residual plane 25 by writing updated residual values from the updated temporal buffer 24. In practice this may involve storing the plane from the temporal buffer in a separate memory location.

The residual plane 25 may be sent to an 'apply residuals' module 21 which functions to apply the residual plane or map to the (upscaled) base frame.

The 'apply residuals' module 21 reads the residual plane from memory, reads the values of the (upscaled) base from memory, applies the residuals to the base frame and stores that in memory as a new plane, i.e. an output plane. That output may then be sent to the output path.

As is evident, such a pipeline involves many operations in and out of memory. This is exacerbated by the transfer of data between discrete modules but is not solely dependent on it. The read of the temporal buffer 201, write of a residual plane representing the residual map 202 and the application of that plane to an (upscaled) base plane and writing 203 to an output plane 21a involves operations in and out of memory.

In detail, the modularisation and memory reads are inefficient because the pipeline is generating and then writing the residual plane out in the 'residual extractor' or 'processing' module 29, and then reading it shortly after again in the 'apply residuals' module 21. Moreover, since a large percentage of values in the residual plane are zero, a large proportion of the reading and writing of the residual plane is unnecessary since there will be no change to the decoded (upscaled) base plane by the application of the zero value stored in the temporal buffer.

That is, conventionally, the pipeline would be performed in steps. With a temporal map, one would normally expect to effectively copy that to your to be applied map/plane/surface, and then add the new residuals to it. Then, when it gets applied, which will then be passed to something else, which will then read it, read the image that's been upscaled, apply the residual to it and write it back.

In examples of the present disclosure, a representation of the temporal buffer is stored which can be used on a granular basis to identify if a portion of the temporal buffer should be used or read, i.e. because it contains non-zero values. In this way the need to read and write a residual plane may be obviated since the values of the temporal buffer can be combined with the (upscaled) base without a residual plane being generated and skipping zero areas, not only reducing memory operations but also reducing the time taken overall for those operations since fewer are needed. Moreover, there is a synergistic effect when combined with the LCEVC structure. That is, a tile or a block can be set to have temporal coding turned on or off leading to processing improvements when compared to whole-frame processing. The presence of data appearing on a TUs may be enough to mark residuals as present. It may then be marked as not present if the tile it is on has been cleared, or the whole residual plane has been cleared.

Figure 3 illustrates the concepts of the invention at a high-level in schematic form. The steps of obtaining the preliminary set of residuals and, optionally, generating a set of commands for application to the temporal buffer remain unchanged. The processing module receives these commands and updates the temporal buffer 24. In accordance with examples described herein the processing module generates a representation 37 of the temporal buffer 24, the representation indicating which areas of the buffer are non-zero.

In implementations, the representation may be in the form of a set of flags, with each flag corresponding to a region of the temporal buffer, or temporal map. Each flag may indicate the region contains zeros or non-zeros. For example, a non-zero value may correspond to TRUE or FALSE and vice-versa. That is, by indicating which values are zero, the representation also inherently indicates which values are non-zero.

Typically, the temporal buffer contains values corresponding to pixel elements of 20 the video signal, a 2D array, and can be thought of as a map. Such an arrangement is non-limiting.

The temporal flags may each correspond to a region of that map, for example the map may be a flag for each element or for a block of elements, e.g. 2 x 2, 4 x 4 or 32 x 32. In preferred examples a flag corresponds to a coding unit or a transform block of the temporal map.

Residuals data in LCEVC is typically coded in one of two formats. As stated in the LCEVC standard, a residuals plane is divided into coding units whose size depends on the size of the transform used. Decoding units have either dimension 2 x 2 if a 2 x 2 directional decomposition transform is used (DD) or a dimension 4 x 4 if a 4 x 4 directional decomposition is used (DDS). The specifics of the decompositions are not important but further details may be found in the LCEVC standard specification, ISO/IEC 23094-2 MPEG Part 2 Low Complexity Enhancement Video Coding (LCEVC), First edition October 2021, W02020/089618 and in W020202/05957, each of which are both incorporated by reference in their entirety.

In preferred examples, each DD or DDS unit has a corresponding flag. Consequently a 1920 x 1080 temporal flag may be coded using DD units and may be represented by 960 flags. Most preferably, a 32 x 32 tile has a corresponding flag -logically this is an important size in LCEVC as it is the size that is cleared but also it is the unit that fits all transform units.

The flags may be arranged or sorted in any order suitable to understand location in the map or 20 array, such as Z-order or row-wise or any typical 20 matrix encoding, for example. For example, the flags may be indexed such as by x,y location.

In another example, the flags may be stored in the order of the commands received or in the order of the transform units decoded from the stream. In other words, and in a detailed implementation, an implementation order in which the tiles are sent in an LCEVC stream is predefined and the order which the transform units within those files is processed also predefined. So if those are sent it in that order, and the process also applies the temporal buffer in that order, and puts that in memory in that order as well, the flags may be also stored in that order, i.e. a 64 bit unsigned integer may be stored per 32 x 32 tile, which is the number transform units when utilising a DDS transform.

This may be beneficial in practice since, with a CPU implementation that is doing write-in-place' on the data as set out in detail below, it can read the LCEVC commands, go through those, update the temporal buffer and update the temporal flags, whilst also reading the data. In that particular case, the process can bypass generating a residual plane completely, because the processing and apply residuals modules can be merged together. Since the process is writing in place, it means that the steps can be performed on the data in a particular order Based on the representation of the temporal map, the pipeline may read only regions of the temporal buffer that are non-zero when combining the temporal buffer with the (upscaled) base. That is, instead of generating a residual plane, the 'apply residuals' module reads the flag for each region and if it indicates the presence of non-zero data, the temporal buffer is read for that region and combined with the (upscaled) base. If not, that region is not read from the temporal buffer, nor from the (upscaled) base. Memory access is expensive.

In preferred implementations the 'processing' and 'apply residuals' modules may be combined as one module implementing the read, update and combine functionality. This further reduces reads from memory.

More detailed example implementations will now be described in the context of figures 4 to 7. In the illustration of figures 4 to 7, the 'apply residuals' and 'processing' modules are combined.

In a first example implementation, referred to here as the 'write-in-place' embodiment, the (upscaled) base is updated directly to where it is stored in memory. Note, not all chipsets support or allow such direct update of memory locations.

As shown in figure 4, the temporal buffer is read and updated. Temporal flags are then read for each area. If that area indicates non-zero values the temporal buffer is read for that area of the map and combined with the corresponding elements of the (upscaled) base plane and written back to memory. The (upscaled) base is then sent to the output path.

The steps may be summarised as: WRITE IN PLACE (TEMP FLAG = TRUE) 1. Read base 2. Read residuals 3. Add + clamp base & residual 4. Write back to base WRITE IN PLACE (TEMP FLAG = FALSE) 1. Nothing to do This example may be particularly beneficial. A first efficiency is that there may be no need to generate a residuals plane. This saves an entire resolution sized image worth of memory bandwidth and is particularly useful when retrofitting the scheme to old hardware. A second efficiency is that, if the flag indicates zero values, the processing can take no action and move to the next area. That is, since the process overwrites or otherwise updates the original/received base, rather than generating a new image/plane, there are fewer memory reads and writes.

Figure 5 illustrates the flow in detail.

First the process decodes the bytestream, step 501. Next, at step 502, the LCEVC commands are generated (as described above, this is an optional step). This may be thought of as turning the decoded bytestream into a set of commands. The temporal buffer is then read and updated, step 503. This may comprise reading the temporal buffer and performing commands to update the temporal buffer with the modified residual value.

At step 504, for an area, or each area, for example a block or tile, the process generates a temporal flag, where a flag being true indicates a non-zero residual for that area. On subsequent passes through the process this may comprise a temporal flag change or tag instead of a write from scratch. By tag, we mean marking the temporal flag for that area, that is a retained state of the decode.

The process continues on an area-by-area basis, e.g. a coding block by coding block, checking if the temporal flag is true for that area. This is represented at step 505. If the temporal flag is true, the process continues reading the (upscaled) base, step 506, reading the updated temporal buffer to obtain the residual value, step 507, combining the base and residual values to obtain a combined value, step 508, and then writing, or overwriting, that value to the (upscaled) base memory location so that it has the 'combined value', step 509. If the temporal flag is false, i.e. the temporal flag indicates that the corresponding area of the temporal map contains zero values, the process does nothing and moves on to the next area/block. In the flowchart of figure 5, the iterative nature of the area processing is indicated by checks, step 510, after each area has been processed. The steps of reading the (upscaled) base, reading the temporal buffer, combining the base and residual value and writing to base, or in the case of a false flag doing nothing, are repeated until the entire frame is completed or done. After this, the updated memory location storing the now updated (upscaled) base plane is output to the output path, at step 511.

Figure 6 illustrates a further example, referred to here as the 'copy-style-apply' example. Again, here we show the 'processing' module and the 'apply' residuals module as combined. In this example, the (upscaled) base is read and combined, being written to a destination at a further memory location before being output to the output path.

This example may be particularly suited to where the chipset does not support 'write-in-place'.

As shown in figure 6, the temporal buffer is read and updated. The temporal flags are then read for each area. If that area indicates non-zero values, the temporal buffer is read for that area of the map. The (upscaled) base is read and combined with those values. The combination is written to a new destination different from where the (upscaled) base was read from. Note, in this example, since the process is writing to a new destination, the value of the base is written to the new destination if flag indicates zeros. The temporal buffer is not read for that area.

The steps may be summarised as: COPY STYLE APPLY (TEMP FLAG = TRUE) 1. Read base 2. Read residuals 3. Add + clamp base & residuals 4. Write to destination COPY STYLE APPLY (TEMP FLAG = FALSE) 1. Read base 2. Write to destination Efficiencies in this example may be that there is no need to generate the residual plane. This saves an entire resolution size image worth of memory bandwidth and is particularly useful if you are retrofitting the scheme to old hardware.

Figure 7 illustrates this example process in detail.

The process begins by first decoding the bytestream, step 701. Optionally, the bytestream is turned into 'commands'. That is, commands are generated, step 702. Next, the temporal buffer is read and updated, step 703, i.e. the temporal buffer is read and the process performs the commands to update the temporal buffer with the modified residual value.

As above, for an area (e.g. a block or a tile), a temporal flag is generated, step 704. The temporal flag is true if there are non-zero residuals for that area.

Iterating for each area, i.e. area by area or block by block, the process continues to check if the temporal flag is true, step 705. If the temporal flag is true the process continues to read the (upscaled) base, step 706, read the residual value in the temporal buffer, step 707, combine the base picture element with the residual value, step 708. The process continues, step 709, in this example by writing (or copying) the combined value to a (new) destination. That is, an 'updated base plane destination', which is different to the destination where the (upscaled) base was read from. As above, the process continues, at step 710, until the entire frame of the temporal buffer and base are done.

If the temporal flag indicates the area of the temporal buffer contains zero values, then the process reads the (upscaled) base, step 712, and writes that base value to the destination, step 713. At step 711, the destination plane, i.e. the 'final output plane', is output to the output path, step 711.

Figure 8 illustrates a flow of a general implementation of the above-described concept. At step 801, the process first obtains a first set of residuals. At step 802, the process adds the first residuals to the temporal buffer. The temporal buffer stores a set of temporal residual values, each temporal residual value corresponding to a pixel elements of a frame of the original input signal. At step 803, the process comprises storing a representation of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values.

Preferably, at step 804, the process comprises combining a reconstructed version of an input video with the temporal residual values based on whether a corresponding element in the representation indicates there is non-zero data in a respective portion of the set of temporal residual values. In preferred examples, the one or more layers of residual data are generated based on a comparison of data derived from a decoded video signal and data derived from an input video signal.

Embodiments of the disclosure may be performed at a decoder or in a module of a decoder, for example implemented in a client device or client device decoding from a data store. Methods and processes described herein can be embodied as code (e.g. software code) and/or data. The decoder may be implemented in hardware or software as is well-known in the art of data compression, for example hardware acceleration using a specifically programmed graphical processing unit (GPU) or a specifically designed field programmable gate array (FPGA) may provide certain efficiencies. For completeness, such code and data can be stored on one or more computable-readable medium, which may include any device or medium that can store code and/or data for use by a computer system. Where a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and codes stored within the computer-readable storage medium. In certain embodiments, one or more of the steps of the methods and processes described herein can be performed by a processor (e.g. a processor of a computer system or data storage system) Generally, any of the funcfionalities described in this text or illustrated in the figures can be implemented using software, firmware (e.g. fixed logic circuitry), programmable or non-programmable hardware, or a combination of these implementations. The terms 'component' or 'function' as used herein generally represent software, firmware, hardware or a combination of these. For instance, in a case of a software implementation, the terms 'component' or 'function' may refer to program code that perform specific tasks when executed on a processing device or devices. The illustrated separation of components and functions into distinct units may reflect any actual or conceptual physical grouping or allocation of such software and/or hardware and tasks.

Claims

CLAIMS1. A method of implementing an enhancement decoding, comprising: obtaining a preliminary set of residuals from an encoded enhancement signal representative of one or more layers of residual data, adding the preliminary set of residuals to a temporal buffer, the temporal buffer storing a set of temporal residual values; and, storing a representation of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values.
2. A method according to claim 1, further comprising: combining a reconstructed version of an original input video signal with the temporal residual values based on whether a corresponding element in the representation indicates there is non-zero data in a respective portion of the set of temporal residual values.
3. A method according to claim 2, wherein the step of combining comprises: reading a value in the temporal buffer in response to determining that the representation indicates the value is non-zero.
4. A method according to claim 2 or 3, wherein the reconstructed version is an upscaled rendition of a decoded rendition of the original input video signal encoded according to a base codec.
5. A method according to any of claims 2 to 4, further comprising, where the representation indicates a value in the set of temporal residual values is non-zero: reading the value in the set of temporal residual values from the temporal 25 buffer reading a corresponding picture element in the reconstructed version of an input video; and, combining the value and the picture element.
A method according to claim 5, further comprising: writing the combination to the memory location of the picture element, the reconstructed version being a plane of video.
A method according to claim 5, further comprising: writing the combination to a plane stored in a further memory location
8. A method according to any of claims 2 to 7, further comprising, where the representation indicates a value in the set of temporal residual values is zero: reading a corresponding picture element in the reconstructed version of an input video; writing the corresponding a picture element to a plane stored in a new memory location.
9. A method according to any of claims 2 to 7, further comprising, where the representation indicates a value in the set of temporal residual values is zero: continuing to process a next value in the set of temporal residual values and not modifying a corresponding picture element in the reconstructed version of an input video.
10. A method according to any of claims 6 or 7 and 8 or 9, further comprising, after processing each value in the set of temporal residual values based on whether the representation indicates a value in the set of temporal residual values is zero or non-zero, outputting the plane to an output path.
11. A method according to any preceding claim, further comprising: generating a set of commands for updating the temporal residual values based on the preliminary set of residuals; and, adding the preliminary set of residuals to the temporal buffer based on 25 the commands.
12. A method according to any preceding claim, wherein the one or more layers of residual data are generated based on a comparison of data derived from a decoded video signal and data derived from an original input video signal.
13. A method according to any preceding claim, wherein the portion of the set of temporal residual values comprises a plurality of the set of temporal residual values.
14. A method according to any preceding claim, wherein the portion is a block of the set of temporal residual values.
15. A method according to any preceding claim, wherein elements of the representation are stored in an order of transform units received in the enhancement signal.
16. A video decoder configured to decode an encoded enhancement signal, the video decoder comprising: a residuals decoding module configured to obtain a preliminary set of residuals from the encoded enhancement signal representative of one or more layers of residual data; and, a residuals processing module configured to: add the preliminary set of residuals to a temporal buffer, the temporal buffer storing a set of temporal residual values; and, store a representation of the set of temporal residual values, the representation indicating whether one or more values of the set of temporal residual values is non-zero and wherein each element in the representation corresponds to a portion of the set of temporal residual values.
17. A video decoder according to claim 16 further comprising: an apply residuals module configured to combine a reconstructed version of an original input video signal with the temporal residual values based on whether a corresponding element in the representation indicates there is non-zero data in a respective portion of the set of temporal residual values.
18. A video decoder according to claim 17, wherein the apply residuals module is configured to: read a value in the temporal buffer in response to determining that the representation indicates the value is non-zero.
19. A video decoder according to claim 17 or 18, wherein the reconstructed version is an upscaled rendition of a decoded rendition of the original input video signal encoded according to a base codec.
20. A video decoder according to any of claims 17 to 19, wherein, where the representation indicates a value in the set of temporal residual values is non-zero, the apply residuals module is configured to: read the value in the set of temporal residual values from the temporal buffer read a corresponding picture element in the reconstructed version of an input video; and, combine the value and the picture element.
21. A video decoder according to claim 20, wherein the apply residuals module is configured to: write the combination to the memory location of the picture element, the reconstructed version being a plane of video.
22. A video decoder according to claim 20, wherein the apply residuals module is configured to: write the combination to a plane stored in a further memory location.
23. A video decoder according to any of claims 17 to 22, wherein, where the representation indicates a value in the set of temporal residual values is zero, the apply residuals module is configured to: read a corresponding picture element in the reconstructed version of an input video; write the corresponding a picture element to a plane stored in a new memory location.
24. A video decoder according to any of claims 17 to 22, wherein, where the representation indicates a value in the set of temporal residual values is zero, the apply residuals module is configured to:: continue to process a next value in the set of temporal residual values and not modifying a corresponding picture element in the reconstructed version of an input video.
25. A video decoder according to claims 21 or 22 and 23 or 24, wherein, after processing each value in the set of temporal residual values based on whether the representation indicates a value in the set of temporal residual values is zero or non-zero, the apply residuals module is configured to: output the plane to an output path.
26. A video decoder according to any of claims 16 to 25, wherein the residuals decoder module is configured to: generate a set of commands for updating the temporal residual values based on the preliminary set of residuals; and, add the preliminary set of residuals to the temporal buffer based on the commands.
27. A video decoder according to any of claims 16 to 26, wherein the one or more layers of residual data are generated based on a comparison of data derived from a decoded video signal and data derived from an original input video signal.
28. A video decoder according to any of claims 16 to 27, wherein the portion of the set of temporal residual values comprises a plurality of the set of temporal residual values.
29. A video decoder according to any of claims 16 to 28, wherein the portion is a block of the set of temporal residual values.
30. A video decoder according to any of claims 16 to 29, wherein elements of the representation are stored in an order of transform units received in the enhancement signal.
31. A chipset for decoding an encoded enhancement video signal, the chipset configured to perform the method of any of claims 1 to 15.
32. A computer readable medium comprising instructions which when executed by a processor perform the method of any of claims 1 to 15.