CN115409697A

CN115409697A - Image processing method and related device

Info

Publication number: CN115409697A
Application number: CN202110594578.4A
Authority: CN
Inventors: 陈培林; 杨文瀚; 王诗淇; 胡康康; 孙龙
Original assignee: Huawei Technologies Co Ltd; City University of Hong Kong CityU
Current assignee: Huawei Technologies Co Ltd; City University of Hong Kong CityU
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2022-11-29

Abstract

The application discloses an image processing method which is applied to the technical field of artificial intelligence. The method comprises the following steps: acquiring a first image and first spatial information in the code stream information, wherein the first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial information, and the first spatial information is coding information of the first image on the space; performing feature extraction processing on the first image to obtain a first feature; performing feature extraction processing on the first airspace information to obtain a second feature; executing fusion processing on the first characteristic and the second characteristic to obtain a first fusion characteristic; and reconstructing to obtain a target image based on the first fusion characteristic, wherein the resolution of the target image is higher than that of the first image. According to the scheme, the characteristics of spatial domain information are blended in the process of image super-resolution reconstruction, so that the effect of image super-resolution reconstruction can be improved, and a reconstructed image with higher quality can be obtained.

Description

Image processing method and related device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method and a related apparatus.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The traditional video super-resolution algorithm can reconstruct a group of continuous low-resolution frames through technologies such as linear interpolation and the like, so that a high-resolution image is obtained through fusion. But conventional methods generally fail to produce sharp textures in the super-resolution results. In recent years, with the rapid development of deep learning technology, in order to improve the effect of super-resolution technology, the academic world has generally started to use a deep learning method to reconstruct an image with low resolution into an image with high resolution.

Currently, for video transmitted in the form of a code stream, a decoder decodes the code stream to obtain an image sequence in the related art. Then, the image sequence is input into a neural network for super-resolution reconstruction, so that the image sequence of low resolution is reconstructed into an image sequence of high resolution. However, in the related art, effective information in the code stream is not utilized in the process of reconstructing the image sequence, and the image reconstruction effect is general.

Disclosure of Invention

The embodiment of the application provides an image processing method and a related device, spatial domain information in code stream information is added in the process of image super-resolution reconstruction, the characteristics obtained by extracting the spatial domain information are fused with image characteristics to obtain fusion characteristics, and finally a high-resolution target image is obtained based on the fusion characteristic reconstruction. Because the spatial information in the code stream information can provide the image characteristics of the image in the local space and the encoding details of the image, the characteristics of the spatial information are blended in the process of image super-resolution reconstruction, the effect of image super-resolution reconstruction can be improved, and a reconstructed image with higher quality can be obtained.

The application provides an image processing method, which is applied to a terminal. The method comprises the following steps: the terminal obtains the first image and first airspace information in the code stream information. The first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial domain information, and the first spatial domain information is coding information of the first image on the space. The code stream information may refer to coding information that can be acquired from a decoder and is related to a video, and includes coding information carried in a code stream received by the decoder and coding information generated in a process of processing the code stream by the decoder. The first spatial information can provide image characteristics of the image in a local space and encoding details of the image.

Then, the terminal performs feature extraction processing on the first image to obtain a first feature; and the terminal performs characteristic extraction processing on the first airspace information to obtain a second characteristic. For example, the terminal performs feature extraction processing on the first image and the first spatial information respectively based on different convolutional networks to obtain a first feature and a second feature.

And secondly, the terminal executes fusion processing on the first characteristic and the second characteristic to obtain a first fusion characteristic. For example, the terminal may perform a fusion process on the first feature and the second feature based on a convolutional network.

And finally, reconstructing by the terminal to obtain a target image based on the first fusion characteristic. The resolution of the target image is higher than that of the first image, and the target image is an image after super-resolution reconstruction.

According to the scheme, in the process of image super-resolution reconstruction, airspace information in code stream information is added, the features obtained by extracting the airspace information are fused with image features to obtain fusion features, and finally a high-resolution target image is obtained based on fusion feature reconstruction. Because the spatial information in the code stream information can provide the image characteristics of the image in the local space and the coding details of the image, the characteristics of the spatial information are blended in the process of image super-resolution reconstruction, effective information in the image characteristics can be effectively enhanced, and noise caused by compression in the image characteristics is inhibited, so that the effect of image super-resolution reconstruction is improved, and a reconstructed image with higher quality is obtained.

In one possible implementation, the first spatial information includes one or more of the following information: partition information, a residual signal, and a prediction signal.

Wherein the division information indicates a division condition of the coding block in the first image. The size of the encoded block in the first image is related to the local texture complexity in the first image. Generally, the higher the complexity of the local texture of an image is, the smaller the coding blocks divided by the image are; the lower the local texture complexity of the image, the larger the coding blocks into which the image is divided.

The residual signal represents a difference between a coding block in the first picture and a corresponding reference coding block. The residual signal is actually able to characterize the high frequency texture information present in each coded block.

The prediction signal is a signal corresponding to the first image before performing decoding loop filtering. Briefly, the prediction signal is generated before the de-effect filtering in the decoding process, and the prediction signal is a prediction result of the first image before the decoding loop filtering. In the decoding process, after the decoding loop filtering is performed on the prediction signal, the first image can be obtained.

In a possible implementation manner, the performing, by the terminal, feature extraction processing on the first airspace information to obtain a second feature includes: when the first airspace information comprises various information, stacking the various information in the first airspace information to obtain stacked airspace information; and performing characteristic extraction processing on the stacked airspace information to obtain a second characteristic.

According to the scheme, the multiple information in the first airspace information is stacked, and then the characteristic extraction processing is executed, so that the multiple information can be effectively introduced, and the effective information in the image characteristics can be enhanced.

In a possible implementation manner, the terminal may first perform a stacking process on the first feature and the second feature to obtain a third feature. Then, the terminal respectively executes first convolution processing and second convolution processing on the third feature based on two different convolution networks to obtain a fourth feature and a fifth feature. And the fourth feature and the fifth feature can be used as affine transformation coefficients for performing element-level refinement adjustment on the first feature of the first image, so that the features in the spatial information can be sufficiently combined with the features of the first image. Finally, the terminal performs multiplication operation on the first characteristic and the fourth characteristic to obtain a sixth characteristic; and the terminal performs addition operation on the sixth feature and the fifth feature to obtain the first fusion feature.

According to the method and the device, element-level refinement adjustment is performed on the first feature of the first image based on the fourth feature and the fifth feature which are affine transformation coefficients, so that effective information in the first feature can be effectively enhanced, noise in the first feature is suppressed, and the quality of the fused feature is improved.

In one possible implementation, the method further includes: and the terminal decodes based on the code stream information to obtain a second image and obtains second spatial information corresponding to the second image in the code stream information, wherein the second image is adjacent to the first image in a time domain. Then, the terminal obtains a second fusion feature, the second fusion feature is obtained by performing fusion processing on the feature of the second image and the feature of the second spatial domain information, the feature of the second image is obtained by performing feature extraction processing on the second image, the feature of the second spatial domain information is obtained by performing feature extraction processing on the second spatial domain information, the second image is obtained by decoding based on the code stream information, the second image is adjacent to the first image in a time domain, and the second image corresponds to the second spatial domain information. And finally, the terminal reconstructs the target image based on the first fusion characteristic and the second fusion characteristic.

In the scheme, the terminal realizes the super-resolution reconstruction of the first image based on the first image and one or more images adjacent to the first image, and can effectively realize the super-resolution reconstruction of the first image.

In a possible implementation manner, the acquiring, by the terminal, the second fusion feature includes: and the terminal acquires the second fusion feature from a storage space according to the frame number of the second image, wherein the second fusion feature is stored in the storage space when image processing is performed on the previous image of the first image.

That is, in the process of performing super-resolution reconstruction on each image in the image sequence by the terminal, the terminal needs to acquire the fusion feature corresponding to each image and the fusion features corresponding to the adjacent images of each image to perform super-resolution reconstruction on the images. Therefore, the terminal can store the obtained fusion features of the image into the cache space. Therefore, in the process that the terminal executes super-resolution reconstruction on the subsequent images, the terminal can acquire the corresponding fusion features of the images from the buffer space, the terminal is prevented from repeatedly extracting the fusion features of the images, and the operation efficiency is improved.

In a possible implementation manner, the reconstructing, by the terminal, the target image based on the first fusion feature and the second fusion feature includes the following steps.

The terminal may perform an alignment operation on the second fused feature based on the first fused feature, so as to obtain an alignment feature. The alignment feature is obtained based on the second fusion feature, and the result of each coordinate in the alignment feature is not uniquely determined by a certain coordinate in the second fusion feature, but is determined by a local feature block obtained by taking the coordinate pointed by the motion vector in the second fusion feature as the center and the corresponding attention coefficient.

After the alignment feature is obtained, the terminal calculates the similarity between each coordinate in the first fusion feature and the corresponding coordinate in the alignment feature, so as to obtain an attention map corresponding to the alignment feature. For example, the terminal may calculate a cosine similarity between each coordinate in the first fusion feature and a corresponding coordinate of the coordinate in the alignment feature to obtain the above-mentioned attention map. Wherein, the attention map corresponding to the alignment feature is used for representing the matching condition of the first fusion feature and the alignment feature on each coordinate. The higher the degree of matching of the first fused feature with the alignment feature in coordinates, the greater the value of the corresponding coordinates on the attention map.

And then, the terminal updates the alignment feature according to the similarity of each coordinate on the attention map to obtain the updated alignment feature. For example, the terminal may multiply each coordinate on the alignment feature point by point with a corresponding coordinate on the attention map to obtain an updated alignment feature. That is, the respective coordinates on the alignment feature are updated using the values of the respective coordinates on the attention map as coefficients. Updating the alignment feature based on the attention map may cause a local region on the alignment feature that is more similar to the first fused feature to be adaptively assigned a higher attention weight; and local regions on the alignment feature that have a lower similarity to the first fused feature are assigned a lower attention weight.

And finally, reconstructing to obtain the target image by the terminal based on the first fusion characteristic and the updated alignment characteristic. Specifically, the terminal may perform stacking processing on the first fused feature and the updated alignment feature through a convolution operation to obtain a final fused feature. And the terminal reconstructs the target image based on the final fusion characteristic.

In a possible implementation manner, the performing, based on the first fused feature, an alignment operation on the second fused feature to obtain an alignment feature includes: the terminal acquires a first local feature block corresponding to each coordinate in the first fusion feature and a second local feature block corresponding to each coordinate in the second fusion feature; the terminal determines an attention coefficient between the first local feature block and the second local feature block through a convolutional network; and the terminal performs weighted average operation on each second local feature block in the second fusion feature based on the attention coefficient to obtain the alignment feature.

A second aspect of the present application provides an image processing apparatus including an acquisition unit and a processing unit; the acquisition unit is used for acquiring a first image and first spatial information in code stream information, wherein the first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial information, and the first spatial information is coding information of the first image in space; the processing unit is used for executing feature extraction processing on the first image to obtain a first feature; the processing unit is further configured to perform feature extraction processing on the first airspace information to obtain a second feature; the processing unit is further configured to perform fusion processing on the first feature and the second feature to obtain a first fusion feature; the processing unit is further configured to reconstruct a target image based on the first fusion feature, where a resolution of the target image is higher than that of the first image.

In one possible implementation, the first spatial information includes one or more of the following information: dividing information, a residual signal, and a prediction signal;

wherein the division information represents division of the coding blocks in the first picture, the residual signal represents a difference between the coding blocks in the first picture and corresponding reference coding blocks, and the prediction signal is a signal corresponding to the first picture before decoding loop filtering is performed.

In a possible implementation manner, the processing unit is further configured to: when the first airspace information comprises various information, stacking the various information in the first airspace information to obtain stacked airspace information; and performing characteristic extraction processing on the stacked airspace information to obtain a second characteristic.

In a possible implementation manner, the processing unit is further configured to: performing stacking processing on the first feature and the second feature to obtain a third feature; respectively executing first convolution processing and second convolution processing on the third feature to obtain a fourth feature and a fifth feature; performing multiplication operation on the first characteristic and the fourth characteristic to obtain a sixth characteristic; and performing an addition operation on the sixth feature and the fifth feature to obtain the first fused feature.

In a possible implementation manner, the obtaining unit is further configured to obtain a second fusion feature, where the second fusion feature is obtained by performing fusion processing on a feature of the second image and a feature of the second spatial domain information, the feature of the second image is obtained by performing feature extraction processing on the second image, the feature of the second spatial domain information is obtained by performing feature extraction processing on the second spatial domain information, the second image is obtained by decoding based on the code stream information, the second image is adjacent to the first image in a time domain, and the second image corresponds to the second spatial domain information; the processing unit is further configured to reconstruct and obtain the target image based on the first fusion feature and the second fusion feature.

In a possible implementation manner, the obtaining unit is further configured to obtain the second fusion feature from a storage space according to a frame number of the second image, where the second fusion feature is stored in the storage space when performing image processing on an image previous to the first image.

In a possible implementation manner, the processing unit is further configured to: performing an alignment operation on the second fusion feature based on the first fusion feature to obtain an alignment feature; calculating a similarity between each coordinate in the first fused feature and a corresponding coordinate in the alignment feature; updating the alignment feature according to the similarity to obtain an updated alignment feature; and reconstructing to obtain the target image based on the first fusion characteristic and the updated alignment characteristic.

In a possible implementation manner, the obtaining unit is further configured to obtain a first local feature block corresponding to each coordinate in the first fused feature and a second local feature block corresponding to each coordinate in the second fused feature; the processing unit is further configured to determine an attention coefficient between the first local feature block and the second local feature block through a convolutional network; the processing unit is further configured to perform a weighted average operation on each second local feature block in the second fusion feature based on the attention coefficient to obtain the alignment feature.

A third aspect of the present application provides an image processing apparatus, which may comprise a processor, a memory coupled to the processor, the memory storing program instructions, which when executed by the processor, implement the method of the first aspect. For the processor to execute the steps in each possible implementation manner of the first aspect, reference may be made to the first aspect specifically, and details are not described here.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method of the first aspect described above.

A fifth aspect of the present application provides circuitry comprising processing circuitry configured to perform the method of the first aspect described above.

A sixth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect described above.

A seventh aspect of the present application provides a chip system, which includes a processor, and is configured to enable a server or a threshold value obtaining apparatus to implement the functions referred to in the first aspect, for example, sending or processing data and/or information referred to in the method. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence framework;

FIG. 2 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating a system architecture according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a decoder according to an embodiment of the present application performing decoding;

fig. 6 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 7 is a schematic diagram of first spatial information according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of extracting a first fused feature according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating super-resolution reconstruction performed based on a fused feature of a plurality of images according to an embodiment of the present application;

fig. 10 is a schematic diagram of an implementation method of image processing based on a super-resolution network model according to an embodiment of the present application;

fig. 11 is a schematic diagram of an implementation of an image processing method on a terminal according to an embodiment of the present application;

fig. 12 is a schematic diagram illustrating an implementation of an image processing method according to an embodiment of the present application;

FIG. 13 is a diagram illustrating steps performed by a decoder according to an embodiment of the present application;

FIG. 14 is a diagram illustrating steps performed by a feature extraction module according to an embodiment of the present disclosure;

FIG. 15 is a diagram illustrating steps performed by an alignment module according to an embodiment of the present disclosure;

FIG. 16 is a diagram illustrating an alignment operation performed according to an embodiment of the present disclosure;

FIG. 17 is a schematic diagram illustrating steps performed by a feature fusion module according to an embodiment of the present disclosure;

FIG. 18 is a schematic diagram illustrating steps performed by a feature reconstruction module according to an embodiment of the present disclosure;

fig. 19 is a schematic diagram illustrating comparison of effects of different image processing methods according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 21 is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 22 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present application will be described below with reference to the drawings. The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) An infrastructure.

The infrastructure provides computing power support for the artificial intelligent system, communication with the outside world is achieved, and support is achieved through the foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) And (6) data.

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) And (6) data processing.

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) Universal capability.

After the above-mentioned data processing, further general capabilities may be formed based on the results of the data processing, such as algorithms or a general system, for example, translation, analysis of text, computer vision processing, speech recognition, recognition of images, and so on.

(5) Intelligent products and industrial applications.

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..

The method provided by the application is described from the model training side and the model application side as follows:

the model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and a trained neural network model (such as a target neural network model in the embodiment of the application) is finally obtained; and the target neural network model can be used for model reasoning, and specifically, input data can be input into the target neural network model to obtain output data.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) A neural network.

The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:

wherein s =1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting together a plurality of the above-mentioned single neural units, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Convolutional Neural Networks (CNN) are a deep neural Network with convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer (for example, a first convolutional layer and a second convolutional layer in the present embodiment) for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

Specifically, as shown in fig. 2, convolutional Neural Network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

The structure formed by the convolutional layer/pooling layer 120 and the neural network layer 130 may be a first convolutional layer and a second convolutional layer described in this application, the input layer 110 is connected to the convolutional layer/pooling layer 120, the convolutional layer/pooling layer 120 is connected to the neural network layer 130, the output of the neural network layer 130 may be input to the active layer, and the active layer may perform nonlinear processing on the output of the neural network layer 130.

Convolutional/pooling layers 120. A convolutional layer: convolutional/pooling layers 120 as shown in FIG. 2 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 is convolutional layers, 126 is pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. That is, the output of a convolutional layer may be used as the input of a subsequent pooling layer, or may be used as the input of another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step stride), so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, the other weight matrix is used for extracting specific colors of the image, the other weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the weight matrixes are the same, the dimensions of feature graphs extracted by the weight matrixes with the same dimensions are also the same, and the extracted feature graphs with the same dimensions are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer: since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after a convolutional layer, i.e., layers 121-126 as illustrated by 120 in fig. 2, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers.

The neural network layer 130: after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other relevant information), the convolutional neural network 100 needs to generate one or a set of the required class number of outputs using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

(3) A deep neural network.

Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein,

is the input vector of the input vector,

is the output vector of the digital video signal,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.

(4) A loss function.

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(5) A back propagation algorithm.

The convolutional neural network can adopt a back propagation (S21P 000217) algorithm to correct the size of the parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(6) And (4) linear operation.

Linearity refers to a proportional, linear relationship between a quantity and a quantity, and is understood mathematically as a function of which the first derivative is a constant, and linear operations can be, but are not limited to, addition operations, null operations, identity operations, convolution operations, batch normalization BN operations, and pooling operations. Linear operations may also be referred to as linear mapping, which requires two conditions to be satisfied: homogeneity and additivity, and non-linearity if either condition is not met.

Wherein, homogeneity means f (ax) = af (x); additive means f (x + y) = f (x) + f (y); for example, f (x) = ax is linear. It should be noted that x, a, and f (x) herein are not necessarily scalars, and may be vectors or matrices, forming a linear space of any dimension. If x, f (x) are n-dimensional vectors, when a is constant, the equivalence meets the homogeneity, and when a is matrix, the equivalence meets the additivity. In contrast, a function graph that is a straight line does not necessarily conform to a linear mapping, such as f (x) = ax + b, and neither homogeneity nor additivity is satisfied, and thus belongs to a non-linear mapping.

In the embodiment of the present application, a composite of a plurality of linear operations may be referred to as a linear operation, and each linear operation included in the linear operation may also be referred to as a sub-linear operation.

Fig. 4 is a schematic diagram of a system architecture provided in an embodiment of the present application, in fig. 4, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140.

During the process that the execution device 120 performs preprocessing on the input data or the calculation module 111 of the execution device 120 performs relevant processes such as calculation (for example, performing function implementation of a neural network in the present application), the execution device 120 may call data, codes, and the like in the data storage system 150 for corresponding processes, and may also store data, instructions, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

Alternatively, the client device 140 may be, for example, a control unit in an automatic driving system, a functional algorithm module in a mobile phone terminal, and the functional algorithm module may be used to implement related tasks, for example.

It should be noted that the training device 120 may generate corresponding target models/rules (e.g., target neural network models in this embodiment) based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 4, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific form may be a display, a sound, an action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 4 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

Deep learning methods, particularly CNN-based convolutional neural network methods, are a key driving force for the development of the field of artificial intelligence in recent years, and have a remarkable effect on various tasks of computer vision. In the field of image processing, methods based on deep learning have surpassed conventional methods.

The traditional video super-resolution algorithm can reconstruct a group of continuous low-resolution frames through technologies such as linear interpolation and the like, so that a high-resolution image is obtained through fusion. But conventional methods generally fail to produce sharp textures in the super-resolution results. In recent years, with the rapid development of deep learning technology, in order to improve the effect of super-resolution technology, the academic world generally starts to reconstruct an image with low resolution into an image with high resolution by using a deep learning method.

Currently, for video transmitted in the form of a code stream, a decoder decodes the code stream to obtain an image sequence in the related art. Then, the image sequence is input into a neural network for super-resolution reconstruction, thereby reconstructing the image sequence of low resolution into an image sequence of high resolution. However, in the related art, effective information in the code stream is not utilized in the process of reconstructing the image sequence, and the image reconstruction effect is general.

In addition, in the related art, the result of downsampling of the high-resolution video and the high-resolution video is usually used as a training data pair to train a neural network model for super-resolution reconstruction. However, since the decoded video has been compression-encoded, it contains some compression noise. Under the condition, the super-resolution reconstruction is carried out on the decoded video by adopting the neural network model obtained based on the clear data training, compression noise is inevitably amplified, and the image reconstruction effect is general.

In view of this, an embodiment of the present application provides an image processing method, which includes adding spatial information in code stream information in an image super-resolution reconstruction process, fusing features obtained by extracting the spatial information with image features to obtain fusion features, and reconstructing based on the fusion features to obtain a high-resolution target image. Because the spatial information in the code stream information can provide the image characteristics of the image in the local space and the encoding details of the image, the characteristics of the spatial information are blended in the process of image super-resolution reconstruction, the effect of image super-resolution reconstruction can be improved, and a reconstructed image with higher quality can be obtained.

The image processing method provided by the embodiment of the application can be applied to a terminal, in particular to a terminal for receiving code stream information and obtaining a video based on the code stream information. Illustratively, the terminal may be, for example, a mobile phone (mobile phone), a Personal Computer (PC), a notebook computer, a server, a tablet computer, a smart tv, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), and the like. For convenience of description, the image processing method provided in the embodiment of the present application will be described below by taking an example in which the image processing method is applied to a terminal.

It should be understood that the image in the embodiment of the present application may be a static image (or referred to as a static picture) or a dynamic image (or referred to as a dynamic picture), such as an RGB image, a black-and-white image, a grayscale image, or the like. For convenience of description, the present application collectively refers to a still image or a moving image as an image in the following embodiments.

For the sake of understanding, technical terms related to the embodiments of the present application will be described below.

Video: video is a sequence of images, consisting of successive image frames, one image frame being an image. Due to the persistence of vision effect of human eyes, when an image sequence is played at a certain speed, the human eyes see a video with continuous motion.

Video compression coding: because the content between adjacent image frames in the video generally has higher similarity, in order to facilitate storage and transmission, the code stream is generated by encoding and compressing the original video, so as to remove redundancy in spatial and temporal dimensions.

Code stream: and the video is subjected to coding compression to obtain the data flow in unit time.

Decoding the code stream: and restoring the video from the code stream which is compactly represented into a continuous image sequence according to the existing information in the code stream and the pre-defined coding and decoding standard.

A characteristic pyramid: a feature pyramid of an image is a series of feature sets arranged in a pyramid shape. The feature pyramid is typically obtained by down-sampling an original feature in sequence, so that the size of the features in the feature pyramid decreases layer by layer.

In the process of encoding a video, an encoder divides an image in the video into a plurality of encoding blocks, and searches a reference encoding block corresponding to the encoding block in a current image in an adjacent image, wherein the reference encoding block is an image block in the adjacent image. The distance between a coding block in a picture and a corresponding reference coding block can be represented by a motion vector. Then, the sending end can send the division condition of the coding block of the image and the motion vector corresponding to the coding block to the receiving end through the code stream, and a decoder in the receiving end decodes based on the information in the code stream to obtain the image sequence.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a decoding performed by a decoder according to an embodiment of the present disclosure. In the process that the decoder executes decoding based on the information in the code stream, the decoder executes operations of entropy decoding, inverse quantization, inverse Discrete Cosine Transform (DCT) change and the like in sequence based on the dividing information of the coding blocks in the code stream to obtain a residual signal. And the decoder also executes intra-frame prediction and motion compensation based on the motion vector in the code stream to respectively obtain an intra-frame prediction result and a motion compensation result. The decoder processes the residual signal, the intra-frame prediction result and the motion compensation result to obtain a prediction signal. Finally, the decoder performs deblocking filtering and Sample Adaptive Offset (SAO) filtering on the prediction signal to obtain a decoded image.

Referring to fig. 6, fig. 6 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 6, the image processing method includes the following steps 601-605.

601, a terminal acquires a first image and first spatial information in code stream information, wherein the first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial information, and the first spatial information is coding information of the first image in space.

In this embodiment, the code stream information may refer to coding information that can be acquired from a decoder and is related to a video, and includes coding information carried in the code stream received by the decoder and coding information generated in a process of processing the code stream by the decoder. After the decoder receives the code stream, the decoder performs decoding operation on the code stream to obtain a series of intermediate information, and finally generates an image sequence. The first image is an image generated after a decoder performs a decoding operation on the code stream.

The code stream received by the decoder includes the coding information related to each image in the image sequence, and the decoder performs decoding operation based on the coding information related to each image in the code stream, so as to generate other coding information. In this way, for each picture in the image sequence, the code stream information corresponding to the picture can be obtained from the decoder. Therefore, after the decoder decodes the first image, the terminal may acquire code stream information corresponding to the first image from the decoder, and determine first spatial information corresponding to the first image from the code stream information.

Generally, coding information corresponding to an image includes time domain information and spatial domain information (hereinafter, referred to as spatial domain information). The temporal information may provide an intuitive inter-image reference relationship, for example, the temporal information may include a motion vector between an image and a reference image. The spatial information can provide image characteristics of the image in local space and coding details of the image.

Optionally, the first spatial information corresponding to the first image may include one or more of the following information: partition information, a residual signal, and a prediction signal. For example, the first spatial information may include any one of partition information, a residual signal, and a prediction signal; the first spatial information may include two kinds of information, i.e., partition information and a residual signal; the first spatial information may further include three kinds of information including partition information, a residual signal, and a prediction signal.

The prediction signal is a signal corresponding to the first image before performing decoding loop filtering. Briefly, the prediction signal is generated before the de-effect filtering in the decoding process, and the prediction signal is the prediction result of the first image before the decoding loop filtering. In the decoding process, after the decoding loop filtering is performed on the prediction signal, the first image can be obtained.

For example, referring to fig. 7, fig. 7 is a schematic diagram of first spatial domain information according to an embodiment of the present application. As shown in fig. 7, the first spatial domain information corresponding to the first image includes partition information, a residual signal, and a prediction signal. Wherein the division information indicates a case of performing coded block division on the first image. As can be seen from fig. 7, for a region with low texture complexity, such as sky, road surface, etc., in the first image, the divided coding blocks are larger; for regions with high texture complexity, such as people and trees in the first image, the divided coding blocks are small.

Step 602, the terminal performs feature extraction processing on the first image to obtain a first feature.

In this embodiment, the terminal may perform the feature extraction processing on the first image by using an existing image feature extraction method, and the embodiment does not limit the manner of performing the feature extraction processing on the first image. Illustratively, the terminal may perform a feature extraction process on the first image through a convolutional network to obtain a first feature of the first image.

And 603, the terminal performs feature extraction processing on the first airspace information to obtain a second feature.

When the first airspace information only includes one type of information, the terminal may perform feature extraction processing on the first airspace information to obtain a second feature.

When the first airspace information comprises various information, the terminal performs stacking processing on the various information in the first airspace information to obtain stacked airspace information; and the terminal performs characteristic extraction processing on the stacked airspace information to obtain a second characteristic. Optionally, the terminal may perform feature extraction processing on the stacked spatial information through a convolutional network to obtain a second feature.

For example, when the first spatial domain information includes partition information, a residual signal, and a prediction signal, the partition information, the residual signal, and the prediction signal are all one channel, the terminal performs stacking processing on the partition information, the residual signal, and the prediction signal to obtain stacked spatial domain information, and the stacked spatial domain information includes three channels. And the terminal performs characteristic extraction processing on the stacked airspace information of the three channels to obtain a second characteristic.

And 604, the terminal executes fusion processing on the first feature and the second feature to obtain a first fusion feature.

In this embodiment, the terminal may perform fusion processing on the first feature and the second feature based on a convolutional network to obtain a first fusion feature. By fusing the first characteristic and the second characteristic, spatial information can be effectively introduced, the characteristic quality is improved, and the performance of super-resolution of the image is enhanced.

Optionally, the terminal may first perform stacking processing on the first feature and the second feature to obtain a third feature. Then, the terminal respectively executes first convolution processing and second convolution processing on the third feature based on two different convolution networks to obtain a fourth feature and a fifth feature. And the fourth feature and the fifth feature can be used as affine transformation coefficients for performing element-level refinement adjustment on the first feature of the first image, so that the features in the spatial information can be fully combined with the features of the first image. Finally, the terminal executes multiplication operation on the first characteristic and the fourth characteristic to obtain a sixth characteristic; and the terminal performs addition operation on the sixth feature and the fifth feature to obtain the first fusion feature.

For example, referring to fig. 8, fig. 8 is a schematic diagram illustrating a first fused feature extraction according to an embodiment of the present application. In the context of the present description of figure 8,

representing stacked spatial information;

representing a first image; GCPI ResBlock represents a Guided Coding code stream Prior Injection residual block (GCPI ResBlock) for performing multiple fusion of the characteristics of an image and the characteristics of spatial information; the GSFT Layer represents a code stream Guided Spatial domain Feature Transform Layer (GSFT Layer) for fusing a Feature of an image and a Feature of Spatial domain information. The GCPI ResBlock comprises a plurality of GSFT layers and a convolution network.

Post-stack airspace information

After the convolution network processing, a second characteristic psi is obtained _t (ii) a First image

After the convolution network processing, the first characteristic is obtained

First characteristic

And a second characteristic Ψ _t The input is one of GSFT layers of GCPI ResBlock. In GSFT Layer, the first feature

And a second characteristic Ψ _t And stacking to obtain a third characteristic. And the third characteristic is respectively subjected to two different convolution networks to obtain a fourth characteristic and a fifth characteristic. The fourth feature may be an affine transformation coefficient γ, and the fifth feature may be an affine transformation coefficient β. First characteristic

And multiplying the affine transformation coefficient gamma, and adding the affine transformation coefficient gamma and the affine transformation coefficient beta to obtain the initial fusion characteristic. After the initial fusion features are processed by a convolution network, the initial fusion features are input into the next GSFT Layer of GCPI ResBlock to execute the operation similar to the above operationThe fusion process is similar to the fusion step. After the fusion features output by the GCPI ResBlock are obtained, the fusion features are continuously and sequentially input into a plurality of subsequent GCPI ResBlock, and finally the first fusion feature is obtained

Optionally, the terminal may adopt 3 to 7 GCPI resblocks to perform feature fusion, and each GCPI ResBlock may include 2 to 4 GSFT layers. For example, as shown in fig. 8, the terminal performs feature fusion using 3 GCPI resblocks, each of which includes 2 GSFT layers. In this embodiment, the number of GCPI ResBlock and GSFT Layer is not specifically limited.

In addition to the above feature fusion method, the terminal may also perform fusion processing on the first feature and the second feature by using an existing feature fusion method, and the embodiment does not limit the feature fusion method.

And 605, reconstructing by the terminal to obtain a target image based on the first fusion characteristic, wherein the resolution of the target image is higher than that of the first image.

Finally, after the first fusion feature is obtained, the terminal can reconstruct and obtain a target image based on the first fusion feature, wherein the target image is an image after super-resolution reconstruction.

In the embodiment, in the process of reconstructing the super-resolution image, the spatial domain information in the code stream information is added, the features obtained by extracting the spatial domain information are fused with the image features to obtain the fusion features, and finally, the high-resolution target image is reconstructed based on the fusion features. Because the spatial information in the code stream information can provide the image characteristics of the image in the local space and the coding details of the image, the characteristics of the spatial information are fused in the super-resolution reconstruction process of the image, effective information in the image characteristics can be effectively enhanced, and noise caused by compression in the image characteristics is inhibited, so that the super-resolution reconstruction effect of the image is improved, and a reconstructed image with higher quality is obtained.

For the convenience of understanding, the process of reconstructing the target image based on the first fusion feature by the terminal will be described in detail below.

In this embodiment, the terminal may implement super-resolution reconstruction of the first image based on the first image and one or more images adjacent to the first image.

Specifically, the terminal may acquire one or more images temporally adjacent to the first image, and find the fusion feature corresponding to the one or more images based on the spatial information corresponding to the one or more images. And finally, reconstructing to obtain a target image based on the first fusion characteristic of the first image and the fusion characteristics corresponding to the one or more images. The process of obtaining the fusion features corresponding to one or more images is similar to the process of obtaining the first fusion feature.

Illustratively, the terminal may decode to obtain a second image based on the code stream information and obtain second spatial information corresponding to the second image in the code stream information, where the second image is adjacent to the first image in the time domain. For example, the second image may be a previous image or a subsequent image of the first image in the sequence of images.

And then, the terminal acquires a second fusion feature corresponding to the second image. Specifically, the terminal performs feature extraction processing on the second image to obtain features of the second image; the terminal performs feature extraction processing on second airspace information corresponding to the second image to obtain features of the second airspace information; and the terminal performs fusion processing on the characteristics of the second image and the characteristics of the second airspace information to obtain second fusion characteristics. The manner in which the terminal performs the fusion processing on the features of the second image and the features of the second spatial information may be the same as the manner in which the fusion processing is performed on the first features and the features of the first spatial information.

And finally, the terminal reconstructs the target image based on the first fusion characteristic and the second fusion characteristic.

Optionally, in order to improve the super-resolution reconstruction quality of the image as much as possible, the terminal may perform an alignment operation and an update operation on the second fusion feature, so that the updated second fusion feature can provide more effective information for the super-resolution reconstruction of the image. The process of reconstructing the target image by the terminal based on the first fusion feature and the second fusion feature may be as follows.

Specifically, the terminal may perform an alignment operation on the second fused feature based on the first fused feature, so as to obtain an alignment feature. The alignment feature is obtained based on the second fusion feature, and the result of each coordinate in the alignment feature is not uniquely determined by a certain coordinate in the second fusion feature, but is determined by a local feature block obtained by taking the coordinate pointed by the motion vector in the second fusion feature as the center and a corresponding attention coefficient.

For example, the terminal may obtain a first local feature block corresponding to each coordinate in the first fused feature and a second local feature block corresponding to each coordinate in the second fused feature. The first local feature block and the second local feature block have the same size, for example, the first local feature block and the second local feature block both have a size of 3*3. The first local feature block is a feature block determined by taking a certain coordinate in the first fusion feature as a central point, and the second local feature block is a feature block determined by taking a certain coordinate in the second fusion feature as a central point. After obtaining the first local feature block and the second local feature block, the terminal determines an attention coefficient between the first local feature block and the second local feature block through a convolution network (e.g., a convolution network including a double-layer fully-connected layer), so as to obtain an attention coefficient of the second local feature block corresponding to each coordinate in the second fusion feature. And on the basis of the attention coefficient of the second local feature block corresponding to each coordinate in the second fusion feature, the terminal performs weighted average operation on each second local feature block in the second fusion feature to obtain the alignment feature. The step of performing weighted average operation is to multiply each element in the second local feature block by a corresponding coefficient in the attention coefficient to obtain a plurality of updated elements, and finally add the plurality of updated elements and calculate an average value after the addition.

And then, the terminal updates the alignment feature according to the similarity of each coordinate on the attention map to obtain the updated alignment feature. For example, the terminal may multiply each coordinate on the alignment feature point by point with a corresponding coordinate on the attention map to obtain an updated alignment feature. That is, the respective coordinates on the alignment feature are updated using the values of the respective coordinates on the attention map as coefficients. Updating the alignment feature based on the attention map may cause a local region on the alignment feature that is more similar to the first fused feature to be adaptively assigned a higher attention weight; and local regions on the alignment feature that are less similar to the first fused feature are assigned a lower attention weight.

In the above example, the terminal performs super-resolution reconstruction of the first image based on the first image and one image adjacent to the first image. In practical applications, the terminal may perform super-resolution reconstruction of the first image based on the first image and a plurality of images adjacent to the first image. For example, the terminal may perform super-resolution reconstruction of the first image based on N images located before the first image and N images located after the first image, where N is an integer greater than or equal to 1. For example, N may be 1 or 2.

Optionally, in the process of performing super-resolution reconstruction on each image in the image sequence by the terminal, the terminal needs to acquire the fusion feature corresponding to each image and the fusion feature corresponding to the adjacent image of each image to perform super-resolution reconstruction on the images. Therefore, the terminal can store the obtained fusion features of the image into the cache space. Therefore, in the process that the terminal executes super-resolution reconstruction on the subsequent images, the terminal can acquire the corresponding fusion features of the images from the buffer space, the terminal is prevented from repeatedly extracting the fusion features of the images, and the operation efficiency is improved.

Illustratively, it is assumed that the terminal performs super-resolution reconstruction of an image based on a current image frame and a previous image frame. Then, the second image is a previous image of the first image, when the terminal performs super-resolution reconstruction on the second image, the terminal obtains a second fusion feature corresponding to the second image and a fusion feature corresponding to the previous image of the second image, and stores the obtained fusion features in the cache space. In this way, during the super-resolution reconstruction of the first image, the terminal may obtain the second fusion feature corresponding to the second image from the storage space according to the frame number of the second image, so as to avoid re-obtaining the fusion feature corresponding to the second image.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating super-resolution reconstruction performed based on a fused feature of a plurality of images according to an embodiment of the present application. As shown, when performing super-resolution reconstruction on image i-1, image i-2 and image i are taken as image i1, the terminal needs to obtain the fusion features corresponding to the image i-2, the image i-1 and the image i respectively. Specifically, the terminal can obtain the fusion characteristic F corresponding to the image i-2 based on the spatial information i-2 and the image i-2 _i-2 Obtaining fusion characteristic F corresponding to image i-1 based on spatial domain information i-1 and image i-1 _i-1 Obtaining fusion characteristic F corresponding to image i based on spatial domain information i and image i _i . Finally, the terminal is based on the fusion feature F _i-2 、F _i-1 And F _i And reconstructing to obtain a super-resolution image i-1. Similarly, when performing super-resolution reconstruction on the image i, the terminal needs to obtain fusion features corresponding to the image i-1, the image i and the image i +1 respectively. Since the terminal already obtains the fusion feature corresponding to the image i-1 when performing super-resolution reconstruction on the image i-1, the terminal can multiplex the extracted fusion feature F _i-1 Thereby avoiding re-solving the fusion feature F _i-1 。

For ease of understanding, the image processing method provided in the embodiments of the present application will be described in detail below with reference to specific examples.

Referring to fig. 10, fig. 10 is a schematic diagram illustrating an implementation of an image processing method based on a super-resolution network model according to an embodiment of the present application. As shown in fig. 10, the above-described image processing method can be performed by a super-resolution network model in the present embodiment. The super-resolution video output by the super-resolution network model can be obtained by inputting code stream information such as division information, residual signals, motion vectors, prediction signals and the like and the low-resolution video to be decoded into the super-resolution network model.

Referring to fig. 11, fig. 11 is a schematic view illustrating an implementation of an image processing method on a terminal according to an embodiment of the present application. As shown in fig. 11, the terminal 1100 includes a super-resolution network model 1110 at a software level. The super-resolution network model 1110 includes a feature extraction module 1111, an alignment module 1112, a feature fusion module 1113, and a feature reconstruction module 1114. The terminal 1100 comprises, at a hardware level, a decoder 1121, a feature buffer space 1122 and an output buffer space 1123.

Referring to fig. 12, fig. 12 is a schematic diagram illustrating an implementation of an image processing method according to an embodiment of the present disclosure. As shown in fig. 11 and 12, the decoder 1121 decodes based on the received code stream, and inputs the code stream information obtained by decoding and the decoded image to the feature extraction module 1111 of the super-resolution network model 1110. The feature extraction module 1111 extracts spatial information features and image features from the spatial information and the decoded image in the code stream information, and outputs the spatial information features and the image features to the feature cache space 1122 and the alignment module 1112. The alignment module 1112 receives time domain information (motion vector) in the code stream information output by the decoder, and performs an alignment operation on the features of the adjacent image frames extracted by the feature extraction module 1111. When the features of all the adjacent image frames are aligned, the features of the current image frame and the features of the aligned adjacent image frames are output to the feature fusion module 1113. The feature fusion module 1113 calculates the inter-frame attention coefficient of the current image frame and each adjacent image frame, and then dynamically fuses the multi-frame features to the unique optimized features by taking the inter-frame attention coefficient as the weight. The optimized features are subjected to high resolution frame reconstruction operations in the feature reconstruction module 1114 through cascaded convolution and sub-pixel convolution to obtain a target image. Finally, the target image is output to the output buffer space 1123, awaiting display.

For the sake of understanding, the steps executed by the respective modules on the terminal will be described in detail below with reference to the accompanying drawings.

Referring to fig. 13, fig. 13 is a schematic diagram illustrating steps performed by a decoder according to an embodiment of the present disclosure. As shown in fig. 13, the steps performed by the decoder include the following steps 1301-1303.

Step 1301, the decoder receives a video stream.

Illustratively, the terminal receives a video stream from a transmitting end and inputs the video stream to a decoder. The video stream may be, for example, a video stream encoded in an h.264 or h.265 encoding standard.

In step 1302, the decoder decodes the video code stream to obtain an image sequence and corresponding code stream information.

And the decoder decodes the acquired video code stream to obtain an image sequence and code stream information corresponding to each image in the image sequence. The code stream information comprises a motion vector, dividing information, a residual signal and a prediction signal.

In step 1303, the decoder outputs the image sequence and the corresponding code stream information.

When the decoder outputs the image sequence and the corresponding code stream information, vectorization and normalization processing can be carried out on the image sequence and the code stream information through the NumPy program library so as to perform feature extraction in the super-resolution network model subsequently.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating steps performed by a feature extraction module according to an embodiment of the present disclosure. As shown in fig. 14, the steps performed by the feature extraction module include the following steps 1401-1407.

In step 1401, the feature extraction module receives a plurality of images and corresponding spatial information.

After the decoder completes decoding, the feature extraction module receives spatial information corresponding to each of the plurality of images from the decoder.

In step 1402, the feature extraction module determines whether there is an image with no extracted features.

When performing super-resolution reconstruction on an image, the feature extraction module needs to extract features of the current image and features of one or more images adjacent to the current image (hereinafter referred to as features of adjacent frames). Therefore, the feature extraction module determines whether or not the images whose features are currently required to be extracted have all been subjected to the feature extraction processing.

In step 1403, if there is an image with no extracted features, the feature extraction module determines whether the features of the image exist in the feature cache space.

After the feature extraction module performs feature extraction processing on the image, the features of the image are stored in the feature cache space. Therefore, the feature extraction module can determine whether the feature of the image exists in the feature cache space according to the frame number of the image, so that the feature extraction processing on the image is prevented from being repeated.

In step 1404, if the features of the image exist in the feature cache space, the feature extraction module reads the features of the image from the feature cache space.

Step 1405, if the feature of the image does not exist in the feature cache space, the feature extraction module performs extraction of the image feature based on the image and the corresponding spatial information, and updates the feature cache space.

The process of the feature extraction module performing the extraction of the image features based on the image and the corresponding spatial information may refer to the embodiment corresponding to fig. 8, which is not described herein again.

In step 1406, the feature extraction module generates a feature pyramid corresponding to the features of the image.

In this embodiment, in order to better capture the scale-invariant features and process the complex displacement that may occur in the next alignment step, the feature extraction module performs bilinear downsampling on the features of each image to construct a feature pyramid corresponding to each image. Wherein the feature pyramid comprises a plurality of layers of features, the dimension of the feature at the lowest layer being the largest and the dimension of the feature at the highest layer being the smallest. From bottom to top, the size of the feature pyramid gradually decreases. And (4) gradually carrying out bilinear downsampling on the lowest layer of features to obtain the features with sequentially reduced sizes.

Referring to fig. 15, fig. 15 is a schematic diagram illustrating steps performed by an alignment module according to an embodiment of the present disclosure. As shown in fig. 15, the steps performed by the feature extraction module include the following steps 1501-1502.

Step 1501, obtaining the motion vector in the code stream information and the feature pyramid transmitted by the feature extraction module.

The alignment module acquires a motion vector in the code stream information, and a feature pyramid corresponding to an image needing to perform super-resolution reconstruction and a feature pyramid corresponding to an adjacent frame of the image.

Step 1502, an alignment operation is performed on the feature pyramid of the adjacent frame based on the feature pyramid of the current image.

In step 1502, the alignment module performs the same alignment operation for each layer of features in the feature pyramid of the adjacent frame.

Referring to fig. 16, fig. 16 is a schematic diagram illustrating an alignment operation performed according to an embodiment of the present disclosure. Taking a certain level of features in the feature pyramid as an example, a specific step of the alignment module performing an alignment operation on a certain level of features in the feature pyramid is shown in fig. 16. In particular, the alignment module determines, from the Motion Vector (MV), the characteristics of the current image (i.e. the image on which the super-resolution reconstruction is to be performed)

At each coordinate p in the adjacent frame feature

Corresponding coordinate p' in (a). Wherein p = (x, y),

then, the alignment module determines that the local feature block corresponding to the coordinate p is a feature block of 3*3 size centered on the coordinate p, and the local feature block corresponding to the coordinate p is represented as

The alignment module determines that the local feature block corresponding to the coordinate p ' is a feature block of 3*3 size centered on the coordinate p ', and the local feature block corresponding to the coordinate p ' is expressed as

After obtaining the local feature block corresponding to the coordinate p and the coordinate p ', the local feature block corresponding to the coordinate p and the local feature block corresponding to the coordinate p ' are processed based on the convolution network g, and the attention coefficient k of the local feature block corresponding to the coordinate p ' is generated ^p . Finally, the attention coefficient k of the local feature block corresponding to the coordinate p ^p And carrying out weighted average on the local feature blocks corresponding to the coordinate p 'to obtain an alignment result of the coordinate p'. And executing the alignment operation for each coordinate in the adjacent frame features to obtain the aligned adjacent frame features.

For the feature pyramids of 2N adjacent frames corresponding to the current image, after the alignment module performs the alignment operation, feature pyramids of 2N aligned adjacent frames can be obtained. Wherein N is an integer greater than or equal to 1. At this time, the alignment module may output the obtained feature pyramids of the 2N aligned adjacent frames and the feature pyramid of the current image to the feature fusion module.

Referring to fig. 17, fig. 17 is a schematic diagram illustrating steps performed by a feature fusion module according to an embodiment of the present disclosure. As shown in fig. 17, the steps performed by the feature fusion module include the following steps 1701-1705.

Step 1701, receives the feature pyramid of the aligned adjacent frame and the feature pyramid of the current image.

At step 1702, it is determined whether there are neighboring frame features for which an attention map is not computed.

And step 1703, if the adjacent frame features of which the attention diagrams are not calculated exist, calculating the attention diagrams of the feature pyramids of the aligned adjacent frames.

In this embodiment, for the adjacent frame features of each layer in the feature pyramid of the adjacent frame, the feature fusion module calculates cosine similarity between each coordinate in the adjacent frame features and each coordinate in the feature corresponding to the current image, so as to obtain an attention map corresponding to the feature pyramid of the adjacent frame.

Step 1704, feature pyramids of aligned adjacent frames are enhanced based on the attention map.

In this embodiment, the feature fusion module multiplies, point by point, each coordinate on the adjacent frame feature by a corresponding coordinate on the attention map according to the similarity of each coordinate on the attention map, so as to obtain an updated adjacent frame feature. That is, the respective coordinates on the adjacent frame feature are updated using the values of the respective coordinates on the attention map as coefficients. Updating the adjacent frame features based on the attention map, so that local areas on the adjacent frame features which are more similar to the features corresponding to the current image can be adaptively assigned with higher attention weights; and local areas with lower feature similarity corresponding to the current image on the features of the adjacent frames are assigned with lower attention weights.

And step 1705, if no adjacent frame features of which the attention maps are not calculated exist, stacking all features of the same scale to obtain a fused feature.

After all the adjacent frame features are strengthened based on the attention force diagram, the feature fusion module stacks all the features of the same layer in the feature pyramid, and finally the fused features are obtained.

Referring to fig. 18, fig. 18 is a schematic diagram illustrating steps performed by a feature reconstruction module according to an embodiment of the present disclosure. As shown in fig. 18, the steps performed by the feature reconstruction module include the following steps 1801-1805.

And 1801, performing reconstruction optimization on the fused features by using k cascaded residual blocks with scale fusion.

In this step, the feature reconstruction module generates a reconstructed feature for the fused feature pyramid by using k cascaded residual blocks with scale fusion, and outputs a new feature pyramid. Wherein the new feature pyramid processes the multi-layer features by upsampling, stacking, and convolution to produce a feature that is the same size as the feature map at the bottom layer of the input pyramid.

Step 1802, performing size expansion operation on the features output by the cascade residual block by adopting sub-pixel convolution.

Step 1803, a high resolution reconstructed residual signal is generated by convolution operation.

And the feature reconstruction module performs nonlinear mapping conversion on the features with enlarged sizes through convolution operation to obtain a final high-resolution reconstructed residual signal.

And 1804, adding the reconstructed residual signal and the bilinear interpolation expansion result of the current image to obtain a target image.

And finally, adding the reconstructed residual signal and the bilinear interpolation expansion result of the current image by a characteristic reconstruction module to obtain a target image.

Referring to fig. 19, fig. 19 is a schematic diagram illustrating comparison of effects of different image processing methods according to an embodiment of the present application. In fig. 19, the ordinate is the signal to noise ratio (PSNR) in decibels (decibel, dB). The abscissa is the Frame rate, i.e., the number of transmission Frames Per Second (FPS). In fig. 19, solid dots represent an existing image processing method, and open dots represent an image processing method provided in the embodiment of the present application. It can be seen that the image processing method provided by the embodiment of the present application has better performance on both PSNR and FPS than the existing image processing method.

Referring to fig. 20, fig. 20 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 20, an image processing apparatus according to an embodiment of the present application includes: an acquisition unit 2001 and a processing unit 2002; the acquiring unit 2001 is configured to acquire a first image and first spatial information in code stream information, where the first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial information, and the first spatial information is coding information of the first image in space; the processing unit 2002 is configured to perform feature extraction processing on the first image to obtain a first feature; the processing unit 2002 is further configured to perform feature extraction processing on the first airspace information to obtain a second feature; the processing unit 2002 is further configured to perform a fusion process on the first feature and the second feature to obtain a first fused feature; the processing unit 2002 is further configured to reconstruct a target image based on the first fusion feature, where a resolution of the target image is higher than that of the first image.

wherein the partition information indicates a partition of a coding block in the first picture, the residual signal indicates a difference between the coding block in the first picture and a corresponding reference coding block, and the prediction signal is a signal corresponding to the first picture before performing decoding loop filtering.

In a possible implementation manner, the processing unit 2002 is further configured to: when the first airspace information comprises various information, stacking the various information in the first airspace information to obtain stacked airspace information; and performing characteristic extraction processing on the stacked airspace information to obtain a second characteristic.

In a possible implementation manner, the processing unit 2002 is further configured to: performing stacking processing on the first feature and the second feature to obtain a third feature; respectively executing first convolution processing and second convolution processing on the third feature to obtain a fourth feature and a fifth feature; performing multiplication operation on the first characteristic and the fourth characteristic to obtain a sixth characteristic; and performing an addition operation on the sixth feature and the fifth feature to obtain the first fused feature.

In a possible implementation manner, the obtaining unit 2001 is further configured to obtain a second fusion feature, where the second fusion feature is obtained by performing fusion processing on a feature of the second image and a feature of the second spatial information, the feature of the second image is obtained by performing feature extraction processing on the second image, the feature of the second spatial information is obtained by performing feature extraction processing on the second spatial information, the second image is obtained by decoding based on the code stream information, the second image is adjacent to the first image in a time domain, and the second image corresponds to the second spatial information; the processing unit 2002 is further configured to reconstruct the target image based on the first fusion feature and the second fusion feature.

In a possible implementation manner, the obtaining unit 2001 is further configured to obtain the second fusion feature from a storage space according to the frame number of the second image, where the second fusion feature is stored in the storage space when performing image processing on an image previous to the first image.

In a possible implementation manner, the processing unit 2002 is further configured to: performing an alignment operation on the second fusion feature based on the first fusion feature to obtain an alignment feature; calculating a similarity between each coordinate in the first fused feature and a corresponding coordinate in the alignment feature; updating the alignment feature according to the similarity to obtain an updated alignment feature; and reconstructing to obtain the target image based on the first fusion characteristic and the updated alignment characteristic.

In a possible implementation manner, the obtaining unit 2001 is further configured to obtain a first local feature block corresponding to each coordinate in the first fused feature and a second local feature block corresponding to each coordinate in the second fused feature; the processing unit 2002, further configured to determine an attention coefficient between the first local feature block and the second local feature block through a convolutional network; the processing unit 2002 is further configured to perform a weighted average operation on each second local feature block in the second fusion feature based on the attention coefficient to obtain the alignment feature.

Referring to fig. 21, fig. 21 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2100 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The execution device 2100 may be disposed with the data processing apparatus described in the embodiment corresponding to fig. 21, and is configured to implement the function of data processing in the embodiment corresponding to fig. 21. Specifically, the execution apparatus 2100 includes: a receiver 2101, a transmitter 2102, a processor 2103 and a memory 2104 (wherein the number of processors 2103 in the execution device 2100 may be one or more, for example one processor in fig. 21), wherein the processor 2103 may comprise an application processor 21031 and a communication processor 21032. In some embodiments of the present application, the receiver 2101, the transmitter 2102, the processor 2103, and the memory 2104 may be connected by a bus or other means.

Memory 2104 may include read-only memory and random access memory, and provides instructions and data to processor 2103. A portion of memory 2104 may also include non-volatile random access memory (NVRAM). The memory 2104 stores a processor and operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 2103 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application may be applied to the processor 2103 or implemented by the processor 2103. The processor 2103 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2103. The processor 2103 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 2103 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2104, and the processor 2103 reads the information in the memory 2104 and performs the steps of the above method in combination with the hardware thereof.

The receiver 2101 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. Transmitter 2102 may be configured to output numeric or character information via a first interface; the transmitter 2102 may also be configured to send instructions to the disk groups via the first interface to modify data in the disk groups; the transmitter 2102 may also include a display device such as a display screen.

In one embodiment of the present application, the processor 2103 is configured to execute the image processing method executed by the execution device in the embodiment corresponding to fig. 6.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the image processing method described in the above embodiment, or to cause the chip in the training device to execute the image processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 22, fig. 22 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 2200, NPU 2200 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2203, and the controller 2204 controls the arithmetic circuit 2203 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2203 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 2203 is a two-dimensional systolic array. The operational circuit 2203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 2203 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2202 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2201 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 2208.

The unified memory 2206 is used for storing input data and output data. The weight data is directly passed through a Memory cell Access Controller (DMAC) 2205, which is carried into a weight Memory 2202. The input data is also carried to the unified memory 2206 by the DMAC.

The BIU is a Bus Interface Unit 2222 for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 2209.

The Bus Interface Unit 2222 (Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 2209, and is further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 2205.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2206, to transfer weight data to the weight memory 2202, or to transfer input data to the input memory 2201.

The vector calculation unit 2207 includes a plurality of operation processing units, and performs further processing such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like on the output of the operation circuit 2203 if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2207 can store the processed output vector to the unified memory 2206. For example, the vector calculation unit 2207 may calculate a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 2203, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate activation values. In some implementations, the vector calculation unit 2207 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 2203, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2209 connected to the controller 2204, for storing instructions used by the controller 2204;

the unified memory 2206, the input memory 2201, the weight memory 2202, and the instruction fetch memory 2209 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optics, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device, such as a training device, data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims

1. An image processing method, characterized by comprising:

acquiring a first image and first spatial information in code stream information, wherein the first image is obtained by decoding based on the code stream information, the first image corresponds to the first spatial information, and the first spatial information is coding information of the first image in space;

performing feature extraction processing on the first image to obtain a first feature;

performing feature extraction processing on the first airspace information to obtain a second feature;

performing fusion processing on the first feature and the second feature to obtain a first fusion feature;

and reconstructing to obtain a target image based on the first fusion characteristic, wherein the resolution of the target image is higher than that of the first image.

2. The method of claim 1, wherein the first spatial information comprises one or more of: dividing information, a residual signal, and a prediction signal;

3. The method according to claim 2, wherein the performing the feature extraction process on the first spatial domain information to obtain a second feature comprises:

when the first airspace information comprises various information, stacking the various information in the first airspace information to obtain stacked airspace information;

and performing characteristic extraction processing on the stacked airspace information to obtain a second characteristic.

4. The method according to any one of claims 1 to 3, wherein the performing a fusion process on the first feature and the second feature to obtain a first fused feature comprises:

performing stacking processing on the first feature and the second feature to obtain a third feature;

respectively executing first convolution processing and second convolution processing on the third feature to obtain a fourth feature and a fifth feature;

performing multiplication operation on the first characteristic and the fourth characteristic to obtain a sixth characteristic;

and performing an addition operation on the sixth feature and the fifth feature to obtain the first fused feature.

5. The method according to any one of claims 1-4, further comprising:

acquiring a second fusion characteristic, wherein the second fusion characteristic is obtained by performing fusion processing on the characteristic of the second image and the characteristic of the second spatial domain information, the characteristic of the second image is obtained by performing characteristic extraction processing on the second image, the characteristic of the second spatial domain information is obtained by performing characteristic extraction processing on the second spatial domain information, the second image is obtained by decoding based on the code stream information, the second image is adjacent to the first image in the time domain, and the second image corresponds to the second spatial domain information;

reconstructing a target image based on the first fusion feature includes:

and reconstructing to obtain the target image based on the first fusion characteristic and the second fusion characteristic.

6. The method of claim 5, wherein the obtaining the second fused feature comprises:

and acquiring the second fusion feature from a storage space according to the frame number of the second image, wherein the second fusion feature is stored in the storage space when image processing is performed on the previous image of the first image.

7. The method of claim 5 or 6, wherein the reconstructing the target image based on the first and second fused features comprises:

performing an alignment operation on the second fusion feature based on the first fusion feature to obtain an alignment feature;

calculating a similarity between each coordinate in the first fused feature and a corresponding coordinate in the alignment feature;

updating the alignment feature according to the similarity to obtain an updated alignment feature;

and reconstructing to obtain the target image based on the first fusion characteristic and the updated alignment characteristic.

8. The method of claim 7, wherein performing an alignment operation on the second fused feature based on the first fused feature to obtain an aligned feature comprises:

acquiring a first local feature block corresponding to each coordinate in the first fusion feature and a second local feature block corresponding to each coordinate in the second fusion feature;

determining, by a convolution network, an attention coefficient between the first local feature block and the second local feature block;

and performing weighted average operation on each second local feature block in the second fusion feature based on the attention coefficient to obtain the alignment feature.

9. An image processing apparatus, comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when executed, the image processing apparatus performs the method of any of claims 1 to 8.

10. A computer storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 8.

11. A computer program product having stored thereon instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 8.