WO2024158376A1

WO2024158376A1 - System and method for image enhancement

Info

Publication number: WO2024158376A1
Application number: PCT/US2023/011361
Authority: WO
Inventors: Haoyu REN; Hsilin Huang
Original assignee: Zeku, Inc.
Priority date: 2023-01-23
Filing date: 2023-01-23
Publication date: 2024-08-02

Abstract

A system and a method for image enhancement are provided. The system includes a memory storing instructions and at least one processor coupled to the memory. The at least one processor is configured to, upon executing the instruction: extract a plurality of feature maps from a plurality of frames after motion compensation; generate one or more attention matrices corresponding to the plurality of feature maps; obtain one or more temporal attention corresponding to the one or more attention matrices; and enhance quality of the frames corresponding to the one or more temporal attention.

Description

SYSTEM AND METHOD FOR IMAGE ENHANCEMENT

BACKGROUND

[0001] Embodiments of the present disclosure relate to a system and a method for operating an image signal processor (ISP).

[0002] An image/video capturing device, such as a camera or camera array, can be used to capture an image/video or a picture of a scene. Cameras or camera arrays have been included in many handheld devices, especially since the advent of social media that allows users to upload pictures and videos of themselves, friends, family, pets, or landscapes on the internet with ease and in real-time. Examples of camera components that operate together to capture an image/video include lens(es), image sensor(s), ISP(s), and/or encoders, just to name a few components thereof. The lens, for example, may receive and focus light onto one or more image sensors that are configured to detect photons. When photons impinge on the image sensor, an image signal corresponding to the scene is generated and sent to the ISP. The ISP performs various operations associated with the image signal to generate one or more processed images of the scene that can then be output to a user, stored in memory, or output to the cloud.

SUMMARY

[0003] According to an aspect of the present disclosure, an image enhancement system including a memory and at least one processor coupled to the memory is provided. The processor is configured to extract a plurality of feature maps from a plurality of frames; generate one or more attention matrices corresponding to the plurality of feature maps; obtain one or more temporal attention corresponding to the one or more attention matrices; and enhance quality of the frames corresponding to the one or more temporal attentions.

[0004] According to another aspect of the present disclosure, a method of image enhancement is provided. The method includes extracting a plurality of feature maps from a plurality of frames; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attention.

[0005] According to still another aspect of the present disclosure, a non-transistor computer readable medium is provided. Computer readable instructions is stored on the nontransistor computer readable medium, a computer system will perform a method of image enhancement when the instructions are executed by the computer system. The method includes extracting a plurality of feature maps from a plurality of frames; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attention.

[0006] These illustrative embodiments are mentioned not to limit or define the present disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

[0008] FIG. 1 illustrates a block diagram of an exemplary image enhancement system, according to some embodiments of the present disclosure.

[0009] FIG. 2 illustrates a block diagram of exemplary implementation of the image enhancement system of FIG. 1, according to some embodiments of the present disclosure.

[0010] FIG. 3 illustrates a block diagram of exemplary temporal attention unit of the image enhancement system of FIG. 2, according to some embodiments of the present disclosure.

[0011] FIG. 4 illustrates a block diagram of another exemplary implementation of the image enhancement system of FIG. 1, according to some embodiments of the present disclosure.

[0012] FIG. 5 A illustrates a block diagram of exemplary temporal attention unit of the image enhancement system of FIG. 4, according to some embodiments of the present disclosure.

[0013] FIG. 5B illustrates a block diagram of exemplary temporal attention unit of the image enhancement system of FIG. 4, according to some embodiments of the present disclosure.

[0014] FIG. 5C illustrates a block diagram of exemplary temporal attention unit of the image enhancement system of FIG. 4, according to some embodiments of the present disclosure. [0015] FIG. 6 illustrates a block diagram of exemplary motion compensation unit, according to some embodiments of the present disclosure.

[0016] FIG. 7 illustrates a block diagram of exemplary image enhancement unit, according to some embodiments of the present disclosure.

[0017] FIGs. 8A and 8B illustrate two exemplary input frames for image enhancement, according to some embodiments of the present disclosure.

[0018] FIG. 8C illustrates an exemplary frame enhanced by image enhancement system, according to the prior art.

[0019] FIG. 8D illustrates an exemplary frame enhanced by image enhancement system, according to some embodiments of the present disclosure.

[0020] FIG. 9 illustrates a flowchart of an exemplary method of generating an enhanced image, according to some embodiments of the present disclosure.

[0021] FIG. 10 is a block diagram illustrating an example of a computer system useful for implementing various embodiments set forth in the disclosure.

[0022] Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

[0023] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications. [0024] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0025] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “corresponding to” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0026] Various aspects of method and apparatus will now be described. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0027] For ease of nomenclature, the term “camera” is used herein to refer to an image capture device or other data acquisition device. Such a data acquisition device can be any device or system for acquiring, recording, measuring, estimating, determining and/or computing data representative of a scene, including but not limited to two-dimensional image data, three-dimensional image data, and/or light field data. Such a data acquisition device may include optics, sensors, and image processing electronics for acquiring data representative of a scene, using techniques that are well known in the art. One skilled in the art will recognize that many types of data acquisition devices can be used in connection with the present disclosure, and that the present disclosure is not limited to cameras. Thus, the use of the term “camera” herein is intended to be illustrative and exemplary, but should not be considered to limit the scope of the present disclosure. Specifically, any use of such term herein should be considered to refer to any suitable data acquisition device.

[0028] Image enhancement is one of the most important computer vision applications. Image enhancement units are generally deployed in all cameras, such as mobile phones or digital cameras. Image enhancement is a challenging problem since it consists of multiple units that perform various operations. These conventional units may include, e.g., a super resolution unit, a denoising unit, and/or a high dynamic range (HDR) unit. Recently, deep neural networks (DNNs) have been widely deployed in image enhancement as they demonstrate significant accuracy improvements.

[0029] Conventional image enhancement models may suffer from motion blur since motion consistency is not checked after motion compensation. If the motion is very large or a global motion has different patterns as a local motion, the standard motion compensation might fail, which results in bad performance on the final output enhanced image. Thus, there is an unmet need for an image enhancement model that generates an enhanced image with reduced color distortion and inconsistency.

[0030] To overcome these and other challenges, the present disclosure provides an image enhancement system that employs a learning-based image enhancement system with temporal attention unit to check the consistency between different input frames. Different from existing methods that utilize attention mechanism for single frame only, the present disclosure utilizes a temporal attention unit to evaluate the consistency between different input frames (e.g., whether the first input frame is consistent with the second frame after motion compensation). Two different designs of the temporal attention unit, a sequential architecture and a parallel architecture, are provided in the present disclosure. The sequential architecture calculates the attention matrix within each frame, and then checking the consistency between the attention matrices of different frames in a temporal order. The parallel architecture is directly calculating a temporal attention matrix among different frames, where the size of the attention matrix is related to the parameters of input frames, for example, the number of the input frames, the number of channels in each input frame, the width and height of each input frame. As temporal attention is employed in the input frames and the consistency of attention between different input frames are considered, the present disclosure can reduce the motion blur efficiently.

[0031] Referring to FIG. 1, image enhancement system 100 is provided in an implementation of the present disclosure. Image enhancement is the process of adjusting digital images so that the results are more suitable for display or further image analysis. Image enhancement is basically improving the interpretability or perception of information in images for human viewers and providing ‘better’ input for other automated image processing techniques. For example, you can remove noise, sharpen, or brighten an image, making it easier to identify key features. Image enhancement system 100 is configured to enhance the quality of input images by, for example, denoising, super resolution, color enhancement, demosaicing, de-raining, defogging, etc. Image enhancement system 100, as shown in FIG. 1, includes a motion compensation unit 120, a temporal attention unit 140, and an image enhancement unit 160. [0032] Motion compensation unit 120 is configured to align original input frames. Most imaging systems are subject to mechanical disturbances. They move because they are held by a person or mounted on a moving vehicle. The structures they are mounted to are subject to mechanical vibration. If the angular motion of the camera over an integration time is comparable to the instantaneous field-of-view of a pixel, then the image will be smeared. In general, image motion compensation refers to active control of something (optical element position, focal plane position, index of refraction, surface curvatures, etc.) to stabilize the object space line-of-sight (LOS) of the focal plane array (FPA). The goal is to compensate for unwanted motions of the camera, which is called to “align” the input frames.

[0033] Referring to FIG.2, an implementation of the image enhancement system 200 is provided according to some embodiments of the present disclosure. Image enhancement system 200 is an example of image enhancement system 100 shown in FIG. 1, in which a temporal attention unit 142 employing the sequential architecture is described in detail. A plurality of aligned images processed by motion compensation unit 120 are input into temporal attention unit 142. The plurality of aligned images corresponds to a plurality of frames input into image enhancement system 200. The plurality of aligned images will be further processed by image enhancement unit 160 to improve the quality of the image as described above.

[0034] After the frames are input into temporal attention unit 142, a plurality of feature maps are extracted from corresponding frames. The number of the plurality of the input frames is N, and the input frames are numbered from 1 to N to ease description, where N is a positive integer and greater than one, which means at least two frames are processed by image enhancement system 200 at one time. The feature maps are numbered accordingly. Each of the N feature maps is generated by applying filters or feature extractors to the corresponding aligned frame.

[0035] N attention matrices are then generated corresponding to the N feature maps, and the N attention matrices correspond to the N feature maps one by one. Attention matrices are scalar matrices representing the relative importance of layer activations at different two- dimensional (2D) spatial locations with respect to a target task, i.e., an attention matrix is a grid of numbers that indicates what 2D locations are important for a task. Important locations corresponding to big numbers are usually depicted in red in a heat map. In the present implementation, the N frames are continuous, and a first attention matrix corresponding to a first aligned input frame is used to generate a second attention matrix corresponding to a second aligned input frame. [0036] Referring to FIG. 2 and FIG. 3, FIG. 3 shows a detailed process for generating a temporal attention. Taking a frame 1 of the N input frames as an example. By applying filters or feature extractors on frame 1, a feature map 1 is extracted according to frame 1. Feature extractors or filters help identify different features present in an image like edges, vertical lines, horizontal lines, bends, etc.

[0037] Then an attention matrix 1 is generated corresponding to feature map 1 . Attention matrix 1 may be a soft attention matrix. Soft attention uses “soft shading” to focus on regions. Soft attention can be learned using good old b ackprop agation/gradi ent descent (the same methods that are used to learn the weights of a neural network model.) Soft attention maps typically contain decimals between 0 and 1. Hard attention uses image cropping to focus on regions. It cannot be trained using gradient descent because there’s no derivative for the procedure “crop the image here.” Techniques like reinforcement can be used to train hard attention mechanisms. Hard attention maps are consistent entirely of 0 or 1, and nothing inbetween; 1 corresponds to a pixel that is kept, and 0 corresponds to a pixel that is cropped out. Hard attention is not a good choice in DNN since it is not flexible. In most of the scenarios, hard attention will reduce the accuracy for image enhancement. Thus, in the present implementation, attention matrix 1 is a soft attention matrix generated from feature map 1 by SoftMax. The number of channels in each frame of the N frames is C, and each attention matrix of the plurality of attention matrices is a C X C matrix.

[0038] In the present implementation, a first attention matrix corresponding to a first frame is used to generate a second attention matrix corresponding to a second aligned frame, and the second frame is after and next to the first aligned frame. That is, attention matrix 1 is used to forming attention matrix 1 + 1, as shown in FIG. 3. In the sequential structure, the input frames to be compensated are sorted in a sequential order. The first attention matrix of the previous frame is utilized as an input when calculating the second feature map of the next frame. Consistency between the first attention matrix of the previous frame and the second attention matrix of the next frame is checked herein, as shown in FIG. 3. The consistency check can be implemented but not limited to feature map correlation, feature map concatenation, convolutional layers, or so on. After generating all the N attention matrices for frame 1 to frame N, the N attention matrices are fed together with the N frames (i.e., motion-compensated frames) together into the standard image enhancement network for quality improvement.

[0039] Referring to FIG.4, an implementation of an image enhancement system 400 is provided according to some embodiments of the present disclosure. Image enhancement system 400 is an example of image enhancement system 100 shown in FIG. 1, in which a temporal attention unit 144 employing the parallel architecture is described in detail. A plurality of aligned images processed by motion compensation unit 120 are input into temporal attention unit 144. The plurality of aligned images corresponds to a plurality of frames input into image enhancement system 400. The plurality of aligned images will be further processed image by enhancement unit 160 to improve the quality of the image as described above.

[0040] After the frames are input into temporal attention unit 144, a plurality of feature maps are extracted from corresponding frames. The number of the plurality of the input frames is N, and the input frames are numbered from 1 to N to ease description, where N is a positive integer and greater than one. That is, at least two frames are processed by image enhancement system 400 at one time. The feature maps are numbered accordingly. Each of the N feature maps is generated by applying filters or feature extractors to the corresponding aligned frame. [0041] A one-shot attention matrix is then generated corresponding to the N feature maps, and the one-shot attention matrix corresponds to the N feature maps. The one-shot attention matrix is a scalar matrix representing the relative importance of layer activations at different two-dimensional (2D) spatial locations with respect to a target task, i.e., the one-shot attention matrix is a grid of numbers that indicates what a hybrid information for a task. The one-shot attention matrix can provide information like “which spatial location” plus “which temporal time step” are more important, for example, the top-left of the third frame is more important than the bottom-right of the second frame. In the present implementation, N frames are input into and are compensated by image enhancement system 400 at one time. Referring to FIG. 4, feature extractors or filters are used to identify different features present in an image, like edges, vertical lines, horizontal lines, bends, etc. As shown in FIG. 4, by reshaping and transposing the frames, N feature maps are extracted correspondingly.

[0042] In the present implementation, the number of channels in each frame of the plurality of frames is C, the number of the plurality of frames is N, the width of each frame of the plurality of frames is W, and a height of each frame of the plurality of frames is H, each of the feature maps has a dimension of C x H x W, where C, N, H, and W are positive integers and greater than 1. Corresponding to the above figures, their types of the one-shot attention matrix can be generated.

[0043] Referring to FIG. 5A, an implementation including a process for generating a one-shot attention matrix having a dimension of N X N is shown. As described above, the one- shot attention matrix may be a soft attention matrix. In the present implementation, the one- shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is an N X N matrix. In practice, considering the quality and efficiency of image enhancement, N is usually a small number, for example, 3, 5, 7, 8, 10, 15, etc., and the N X N matrix is a relatively small matrix for image enhancement system 400. Thus, the efficiency of image enhancement system 400 in the present implementation is relatively high. The present implementation is suitable for a situation in which the quality of the original frames is acceptable, and it is easy to improve the quality of the input frames by image enhancement system 400. The quality of the frames can be enhanced by improving the resolution of the frames. For example, the resolution of the original input frames is 720P, and the resolution of the target output video is 4K or Blu-ray. Since the size of the input frames is big, it is necessary to generate a small attention matrix to balance the efficiency and the speed of image enhancement.

[0044] Referring to FIG. 5B, an implementation including a process for generating a one-shot attention matrix having a dimension of CN X CN is shown. As described above, the one-shot attention matrix may be a soft attention matrix. In the present implementation, the one-shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is a CN X CN matrix. In practice, considering the quality and efficiency of image enhancement, N is usually a small number, for example, 3, 5, 7, 8, 10, 15, etc., and a common value of N is 32. The CN X CN matrix is a relatively big matrix for image enhancement system 400 compared with the N X N matrix described above. Employing the CN X CN matrix costs a longer time for image enhancement system 400 compared with employing the N X N matrix. The present implementation is suitable for a situation in which the quality of the original frames is bad, and it is complex to improve the quality of the input frames by image enhancement system 400. The quality of the frames can be enhanced by denoising the frames through complex algorithms. The priority of high quality of the enhanced frames is higher than the priority of the efficiency of image enhancement system 400. For example, the resolution of the original input frames is 360P, and the resolution of the target output video is 1080P or higher. Since the size of the input frames is small and the quality of the input frames is bad, it is necessary to generate a big attention matrix to balance the efficiency and the speed of image enhancement.

[0045] Referring to FIG. 5C, an implementation including a process for generating a one-shot attention matrix having a dimension of WH X WH is shown. As described above, the one-shot attention matrix may be a soft attention matrix. In the present implementation, the one-shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is a WH X WH matrix. The WH X WH matrix is a relatively big matrix for image enhancement system 400 compared with the N X N matrix and the CN X CN matrix described above. The WH X WH matrix is not relevant to the sequence of the frames, which means the WH x WH matrix does not focus on the motion of the elements in the background in each frame. For example, in an extreme situation, like frames with a rainy or foggy background, the noise of the background is too loud, and it is meaningless to focus on the motion of the elements in the background. Employing the WH x WH matrix helps image enhancement to focus on the subjects of the figures and avoid investing too much resource in the background to balance the efficiency and the speed of image enhancement. The quality of the frames can be enhanced by de-raining and/or defogging the frames through complex algorithms.

[0046] Motion compensation unit 120 is configured to generate the frame for the temporal attention unit by compensating and aligning input frames. A schematic of the design of motion compensation unit 120 is shown in FIG. 6, and flow estimation modules are detailed in Table 1. Motion compensation unit 120 is implemented by spatial transformer as shown in FIG. 6. Assume two frames with different exposure are provided as F_t and F_t+1, a compensated frame F'_t+1 can be obtained through an architecture shown in FIG. 6, where two flow estimation modules are used in the two-stage image warping process.

[0047] A X 4 coarse estimate of the flow is obtained by early fusing the two input frames and downscaling spatial dimensions with x 2 strided convolutions. The estimated flow is upscaled with sub-pixel convolution, and the result A ₊₁ is applied to warp the target frame producing F'^c _t+1. The warped image is then processed together with the coarse flow and the original images through a fine flow estimation module. This uses a single stride convolution with stride 2 and a final x 2 upscaling stage to obtain a finer flow map . The final motion- compensated frame is obtained by warping the target frame with the total flow F'_t+1 =

AS shown in table 1, where k is kernel_size, n is number of filters, s is stride, relu and tanh are activations. Output activations use tanh to represent pixel displacement in normalized space, such that a displacement of ±1 means maximum displacement from the center to the border of the image, and ±1 means the next/previous frame. The spatial transformer module is advantageous relative to other motion compensation mechanisms as it is straightforward to combine with an SR network to perform joint motion compensation and video SR.

TABLE. 1

[0048] Image enhancement unit 160 is configured to improve the quality of the frames corresponding to the obtained temporal attention. FIG. 7 illustrates a block diagram of exemplary image enhancement unit 160 according to an embodiment of the present disclosure. Residual in residual dense block (RRDB) is used by image enhancement unit 160 in the present implementation. RRDB blocks without batch normalization (BN) layers are set as the basic network building unit, which combines multi-level residual network and dense connections as depicted in FIG. 7. Removing BN layers has proven to increase performance and reduce computational complexity in different tasks including SR and deblurring. The RRDB blocks, i.e., dense blocks in FIG. 7 are chained together to output the final enhanced frames. The proposed RRDB employs a deeper and more complex structure than the original residual block (RB) in super resolution generative adversarial networks (SRGAN). Specifically, as shown in FIG. 7, the proposed RRDB has a residual-in-residual structure, where residual learning is used at different levels. A similar network structure may also apply, such as a multilevel residual network.

[0049] The present disclosure can be used in a plurality of image enhancement systems, including but not limited to super resolution (SR) system, denoising system, and/or high dynamic ranging (HDR) system. The above implementations are for illustrative purposes only and should not be interpreted as a limitation of the present disclosure. Regardless of the type of the image enhancement system, the temporal attention unit has good compatibility with the system by considering and checking the consistency between temporal attention of the previous frame and temporal attention of the next frame. By reducing the blur and color distortion caused by motion, the present disclosure improves the quality of the output frames enhanced by the image enhancement system. [0050] The advantages of the present disclosure are clearly illustrated in FIGs. 8A to 8D. FIG. 8A and 8B are two exemplary input frames for image enhancement, according to some embodiments of the present disclosure. FIG. 8C illustrates an exemplary frame enhanced by an image enhancement system according to the prior art. FIG. 8D illustrates an exemplary frame enhanced by the image enhancement system of the present disclosure. The enhanced frame in FIG. 8D has sharper borders and more clear color contrast compared to the enhanced frame in FIG. 8C, the quality of the frame enhanced by the image enhancement system of the present disclosure is greatly improved compared to the prior art.

[0051] According to one aspect of the present disclosure, an image enhancement system is provided. The system includes a memory and at least one processor coupled to the memory. The processor is configured to extract a plurality of feature maps from a plurality of frames after motion compensation; generate one or more attention matrices corresponding to the plurality of feature maps; obtain one or more temporal attention corresponding to the one or more attention matrices; and enhance quality of the frames corresponding to the one or more temporal attentions. Detailed implementations of the image enhancement system are similar to the implementations described above and will not be repeated here.

[0052] FIG. 9 illustrates a flowchart of an exemplary method 900 of generating an enhanced image according to some embodiments of the present disclosure. Exemplary method 900 may be performed by an apparatus of image signal processing, e.g., such as image enhancement system 200 or 400. Method 900 may include operations 902-908 as described below. It is to be appreciated that some of the operations may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 9.

[0053] Referring to FIG. 9, at 902, a plurality of feature maps may be extracted from a plurality of frames after motion compensation. Motion compensation aims to align original input frames. Most imaging systems are subject to mechanical disturbances. They move because they are held by a person or mounted on a moving vehicle. The structures they are mounted to are subject to mechanical vibration. If the angular motion of the camera over an integration time is comparable to the instantaneous field-of-view of a pixel, then the image will be smeared. The goal is to compensate for unwanted motions of the camera, which is referred to herein as “align” the input frames. A plurality of aligned images are prepared to be compensated and generate frames. A number of the plurality of the input frames is N, and the input frames are numbered from 1 to N to ease description, where N is a positive integer and greater than one. That is, at least two frames are processed at one time. The feature maps are numbered accordingly. Each of the N feature maps is generated by applying filters or feature extractors to the corresponding aligned frame.

[0054] At 904, one or more attention matrices are generated corresponding to the plurality of feature maps. Attention matrices are scalar matrices representing the relative importance of layer activations at different 2D spatial locations with respect to a target task, i.e., an attention matrix is a grid of numbers that indicates what 2D locations are important for a task. Important locations corresponding to big numbers are usually depicted in red in a heat map.

[0055] In a first situation, N attention matrices are then generated corresponding to the N feature maps, and the N attention matrices correspond to the N feature maps one by one. In the present implementation, the N frames are continuous, and a first attention matrix corresponding to a first aligned input frame is used to generate a second attention matrix corresponding to a second aligned input frame, and the second frame is after and next to the first aligned frame. The number of channels in each frame of the plurality of frames is C, and each attention matrix of the plurality of attention matrices is a C X C matrix, where C is a positive integer and greater than 1.

[0056] In a second situation, A one-shot attention matrix is then generated corresponding to the N feature maps, and the one-shot attention matrix corresponds to the N feature maps. The one-shot attention matrix is a scalar matrix representing the relative importance of layer activations at different 2D spatial locations with respect to a target task, i.e., the one-shot attention matrix is a grid of numbers that indicates what 2D locations are important for a task. In the present implementation, N frames are compensated at one time. In the present implementation, the number of channels in each frame of the plurality of frames is C, the number of the plurality of frames is N, the width of each frame of the plurality of frames is W, and a height of each frame of the plurality of frames is H, each of the feature maps has a dimension of C x H x W, where C, N, H, and W are positive integers and greater than 1. Corresponding to the above figures, their types of the one-shot attention matrix can be generated.

[0057] In a first situation, referring to FIG. 5 A, an implementation including a process for generating a one-shot attention matrix having a dimension of N X N is shown. As described above, the one-shot attention matrix may be a soft attention matrix. In the present implementation, the one-shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is an N X N matrix. In practice, considering the quality and efficiency of image enhancement, N is usually a small number, for example, 5, 8, 10, 15, 20, 25, etc., and the N X N matrix is a relatively small matrix for image enhancement system 400. Thus, the efficiency of image enhancement system 400 in the present implementation is relatively high. The present implementation is suitable for a situation in which the quality of the original frames is acceptable, and it is easy to improve the quality of the input frames by image enhancement system 400. The quality of the frames can be enhanced by improving the resolution of the frames. For example, the resolution of the original input frames is 1080P, and the resolution of the target output video is 4K or Blu-ray. Since the size of the input frames is big, it is necessary to generate a small attention matrix to balance the efficiency and the speed of image enhancement.

[0058] In a second situation, referring to FIG. 5B, an implementation including a process for generating a one-shot attention matrix having a dimension of CN X CN is shown. As described above, the one-shot attention matrix may be a soft attention matrix. In the present implementation, the one-shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is a CN X CN matrix. In practice, considering the quality and efficiency of image enhancement, N is usually a small number, for example, 5, 8, 10, 15, 20, 25, etc., and a common value of N is 32. The CN X CN matrix is a relatively big matrix for image enhancement system 400 compared with the N X N matrix described above. Employing the CN X CN matrix costs a longer time for image enhancement system 400 compared with employing the N X N matrix. The present implementation is suitable for a situation in which the quality of the original frames is bad, and it is complex to improve the quality of the input frames by image enhancement system 400. The quality of the frames can be enhanced by denoising the frames through complex algorithms. The priority of high quality of the enhanced frames is higher than the priority of the efficiency of image enhancement system 400. For example, the resolution of the original input frames is 360P, and the resolution of the target output video is 1080P or higher. Since the size of the input frames is small and the quality of the input frames is bad, it is necessary to generate a big attention matrix to balance the efficiency and the speed of image enhancement.

[0059] In a third situation, referring to FIG. 5C, an implementation including a process for generating a one-shot attention matrix having a dimension of WH X WH is shown. As described above, the one-shot attention matrix may be a soft attention matrix. In the present implementation, the one-shot attention matrix is a soft attention matrix generated from the N feature maps by SoftMax. In the present implementation, the one-shot attention matrix is a WH X WH matrix. The WH X WH matrix is a relatively big matrix for image enhancement system 400 compared with the N X N matrix and the CN X CN matrix described above. The WH x WH matrix is not relevant to the sequence of the frames, which means the WH x WH matrix does not focus on the motion of the elements in the background in each frame. For example, in an extreme situation, like frames with a rainy or foggy background, the noise of the background is too loud, and it is meaningless to focus on the motion of the elements in the background. Employing the WH X WH matrix helps image enhancement to focus on the subjects of the figures and avoid investing too many resources in the background to balance the efficiency and the speed of image enhancement. The quality of the frames can be enhanced by de-raining and/or defogging the frames through complex algorithms.

[0060] Various embodiments can be implemented, for example, using one or more computer systems, such as computer system 1000 shown in FIG. 10. One or more computer system 1000 can be used, for example, to implement method 900 of FIG. 9. For example, computer system 1000 can detect and correct grammatical errors and/or train an artificial neural network model for detecting and correcting grammatical errors, according to various embodiments. Computer system 1000 can be any computer capable of performing the functions described herein.

[0061] Computer system 1000 can be any well-known computer capable of performing the functions described herein. Computer system 1000 includes one or more processors (also called central processing units, or CPUs), such as a processor 1004. Processor 1004 is connected to a communication infrastructure 1006 (e.g., a bus). One or more processors 1004 may each be a graphics processing unit (GPU). In an embodiment, a GPU is a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.

[0062] Computer system 1000 also includes user input/output device(s) 1003, such as monitors, keyboards, pointing devices, etc., that communicate with communication infrastructure 1006 through user input/output interface(s) 1002.

[0063] Computer system 1000 also includes a main or primary memory 1008, such as random-access memory (RAM). Main memory 1008 may include one or more levels of cache. Main memory 1008 has stored therein control logic (i.e., computer software) and/or data. Computer system 1000 may also include one or more secondary storage devices or memory 1010. Secondary memory 1010 may include, for example, a hard disk drive 1012 and/or a removable storage device or a removable storage drive 1014. Removable storage drive 1014 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive. Removable storage drive 1014 may interact with a removable storage unit 1018. Removable storage unit 1018 includes a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 1018 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/ any other computer data storage device. Removable storage drive 1014 reads from and/or writes to removable storage unit 1018 in a well-known manner.

[0064] According to an exemplary embodiment, secondary memory 1010 may include other means, instrumentalities, or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 1000. Such means, instrumentalities or other approaches may include, for example, a removable storage unit 1022 and an interface 1020. Examples of the removable storage unit 1022 and the interface 1020 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.

[0065] Computer system 1000 may further include a communication or network interface 1024. Communication interface 1024 enables computer system 1000 to communicate and interact with any combination of remote devices, remote networks, remote entities, etc. (individually and collectively referenced as 1028). For example, communication interface 1024 may allow computer system 1000 to communicate with remote devices 1028 over communication path 1026, which may be wired and/or wireless, and which may include any combination of LANs, WANs, Internet, etc. Control logic and/or data may be transmitted to and from computer system 1000 via communication path 1026.

[0066] In an embodiment, a tangible apparatus or article of manufacture comprising a tangible computer useable or readable medium having control logic (software) stored thereon is also referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 1000, main memory 1008, secondary memory 1010, and removable storage units 1018 and 1022, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 1000), causes such data processing devices to operate as described herein.

[0067] Corresponding to the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art(s) how to make and use embodiments of the present disclosure using data processing devices, computer systems and/or computer architectures other than that shown in FIG. 10. For example, embodiments may operate with software, hardware, and/or operating system implementations other than those described herein.

[0068] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non- transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as image enhancement system 100 in FIG. 1. By way of example, and not limitation, such computer-readable media can include random-access memory (RAM), read-only memory (ROM), EEPROM, compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer- readable media.

[0069] According to one aspect of the present disclosure, a system for image enhancement is provided. The system includes a memory storing instructions and at least one processor coupled to the memory. The at least one processor is configured to, upon executing the instruction: extract a plurality of feature maps from a plurality of frames after motion compensation; generate one or more attention matrices corresponding to the plurality of feature maps; obtain one or more temporal attention corresponding to the one or more attention matrices; and enhance quality of the frames corresponding to the one or more temporal attention.

[0070] In some implementations, the plurality of frames include at least a plurality of aligned frames, and the plurality of aligned frames correspond to at least a portion of the frames that are motion compensated, a plurality of feature maps are extracted corresponding to the plurality of frames and the one or more attention matrices are generated corresponding to the plurality of feature maps.

[0071] In some implementations, the plurality of frames are continuous, and a first attention matrix of the one or more attention matrices corresponding to a first frame of the plurality of frames is used to generate a second attention matrix of the one or more attention matrices corresponding to a second frame of the plurality of frames. The second frame is after and next to the first aligned frame.

[0072] In some implementations, a number of channels in each frame of the plurality of frames is C, and each attention matrix of the one or more attention matrices is a C X C matrix, where C is a positive integer and greater than 1.

[0073] In some implementations, a plurality of feature maps are extracted corresponding to the plurality of frames, and one attention matrix is generated corresponding to the plurality of feature maps.

[0074] In some implementations, a number of the plurality of frames is N, and the attention matrix is a IV X IV matrix, where N is a positive integer and greater than 1.

[0075] In some implementations, the quality of the frames is enhanced by improving resolution of the frames.

[0076] In some implementations, a number of channels in each frame of the plurality of frames is C, where C a number of the plurality of frames is N, and the attention matrix is a NC X NC matrix, where C and N are positive integers and greater than 1.

[0077] In some implementations, the quality of the plurality of frames is enhanced by denoising the plurality of frames.

[0078] In some implementations, a width of each frame of the plurality of frames is W, a height of each frame of the plurality of frames is H, where W and H are positive integers and greater than 1. The attention matrix is a WH X WH matrix.

[0079] In some implementations, the quality of the plurality of frames is enhanced by de-raining and/or defogging the plurality of frames.

[0080] In some implementations, the system further includes a motion compensation unit configured to compensate and align input frames.

[0081] In some implementations, the system further includes an image enhancement unit configured to improve the quality of the frames based on the obtained one or more temporal attentions. [0082] According to another aspect of the present disclosure, a method of image enhancement is provided. The method includes extracting a plurality of feature maps from a plurality of frames after motion compensation; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attention.

[0083] In some implementations, the plurality of frames include at least a plurality of aligned frames, and the plurality of aligned frames correspond to at least a portion of the frames that are motion compensated, extracting a plurality of feature maps and generating one or more attention matrices includes extracting a plurality of feature maps corresponding to the plurality of frames, and generating the one or more attention matrices corresponding to the plurality of feature maps.

[0084] In some implementations, extracting a plurality of feature maps and generating one or more attention matrices further includes generating a second attention matrix corresponding to a second frame corresponding to a first attention matrix corresponding to a first aligned frame. The plurality of frames are continuous, and the second frame is after and next to the first aligned frame.

[0085] In some implementations, a number of channels in each frame of the plurality of frames is C, and each attention matrix of the one or more attention matrices is a C X C matrix, where C is a positive integer and greater than 1.

[0086] In some implementations, extracting a plurality of feature maps and generating one or more attention matrices includes extracting a plurality of feature maps corresponding to the plurality of frames and generating one attention matrix corresponding to the plurality of feature maps.

[0087] In some implementations, a number of the plurality of frames is N, a number of channels in each frame of the plurality of frames is C, a width of each frame of the plurality of frames is W, and a height of each frame of the plurality of frames is H, where C, N, H, and W are positive integers and greater than 1, and the attention matrix is a N X N matrix, or a NC X NC matrix, or a WH X WH matrix.

[0088] According to still another aspect of the present disclosure, a non-transistor computer-readable medium is provided. Computer-readable instructions is stored on the nontransistor computer readable medium, a computer system will perform a method of image enhancement when the instructions are executed by the computer system. The method includes extracting a plurality of feature maps from a plurality of frames after motion compensation; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attention.

[0089] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

[0090] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0091] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[0092] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments but should be defined only in accordance with the following claims and their equivalents.

Claims

WHAT IS CLAIMED IS

1. A system for image enhancement comprising: a memory storing instructions; and at least one processor coupled to the memory; wherein the at least one processor is configured to, upon executing the instructions: extract a plurality of feature maps from a plurality of frames; generate one or more attention matrices corresponding to the plurality of feature maps; obtain one or more temporal attentions corresponding to the one or more attention matrices; and enhance quality of the frames corresponding to the one or more temporal attentions.

2. The system of claim 1, wherein the plurality of frames comprise at least a plurality of aligned frames, and the plurality of aligned frames correspond to at least a portion of the frames that are motion compensated, the plurality of feature maps are extracted corresponding to the plurality of frames; and the one or more attention matrices are generated corresponding to the plurality of feature maps.

3. The system of claim 2, wherein the plurality of frames are continuous; a first attention matrix of the one or more attention matrices corresponding to a first frame of the plurality of frames is used to generate a second attention matrix of the one or more attention matrices corresponding to a second frame of the plurality of frames; and the second frame is after and next to the first aligned frame.

4. The system of claim 3, wherein a number of channels in each frame of the plurality of frames is C; and each attention matrix of the one or more f attention matrices is a C X C matrix, where C is a positive integer and greater than 1.

5. The system of claim 1, wherein the plurality of feature maps are extracted corresponding to the plurality of frames; and one attention matrix is generated corresponding to the plurality of feature maps.

6. The system of claim 5, wherein a number of the plurality of frames is N, and the attention matrix is a IV X IV matrix, where N is a positive integer and greater than 1.

7. The system of claim 6, wherein the quality of the frames is enhanced by improving resolution of the frames.

8. The system of claim 5, wherein a number of channels in each frame of the plurality of frames is C, a number of the plurality of frames is N, and the attention matrix is a NC X NC matrix, where C and N are positive integers and greater than 1.

9. The system of claim 8, wherein the quality of the plurality of frames is enhanced by denoising the plurality of frames.

10. The system of claim 5, wherein a width of each frame of the plurality of frames is W, a height of each frame of the plurality of frames is H, where W and H are positive integers and greater than 1; and the attention matrix is a WH X WH matrix.

11. The system of claim 10, wherein the quality of the plurality of frames is enhanced by deraining and/or defogging the plurality of frames.

12. The system of claim 1, further comprising a motion compensation unit configured to compensate and align input frames.

13. The system of claim 1, further comprising an image enhancement unit configured to improve the quality of the frames based on the obtained one or more temporal attentions.

14. A method for image enhancement, comprising: extracting a plurality of feature maps from a plurality of frames; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attentions.

15. The method of claim 14, wherein the plurality of frames comprise at least a plurality of aligned frames, and the plurality of aligned frames correspond to at least a portion of the frames that are motion compensated, extracting the plurality of feature maps and generating one or more attention matrices comprises: extracting the plurality of feature maps corresponding to the plurality of frames; and generating the one or more attention matrices corresponding to the plurality of feature maps.

16. The method of claim 15, extracting the plurality of feature maps and generating one or more attention matrices further comprising: generating a second attention matrix of the one or more attention matrices corresponding to a second frame of the plurality of frames corresponding to a first attention matrix of the one or more attention matrices corresponding to a first frame of the plurality of frames, wherein the plurality of frames are continuous, and the second frame is after and next to the first aligned frame.

17. The method of claim 16, wherein a number of channels in each frame of the plurality of frames is C and each attention matrix of the one or more attention matrices is a C X C matrix, where C is a positive integer and greater than 1.

18. The method of claim 14, wherein extracting the plurality of feature maps and generating one or more attention matrices comprises: extracting the plurality of feature maps corresponding to the plurality of frames; and generating one attention matrix corresponding to the plurality of feature maps.

19. The method of claim 18, wherein a number of the plurality of frames is N, a number of channels in each frame of the plurality of frames is C, a width of each frame of the plurality of frames is W, and a height of each frame of the plurality of frames is H, where C, N, H, and W are positive integers and greater than 1, and the attention matrix is a IV X IV matrix, or a NC X NC matrix, or a WH X WH matrix.

20. A non-transistor computer-readable medium having stored thereon computer-readable instructions that when executed by a computer system causes the computer system to perform a method of image enhancement, wherein the method comprises: extracting a plurality of feature maps from a plurality of frames after motion compensation; generating one or more attention matrices corresponding to the plurality of feature maps; obtaining one or more temporal attention corresponding to the one or more attention matrices; and enhancing quality of the frames corresponding to the one or more temporal attentions.