CN112241936B

CN112241936B - Image processing method, device and equipment and storage medium

Info

Publication number: CN112241936B
Application number: CN201910651715.6A
Authority: CN
Inventors: 黄芳
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2023-08-25
Anticipated expiration: 2039-07-18
Also published as: CN112241936A

Abstract

The invention provides an image processing method, an image processing device, image processing equipment and a storage medium, wherein the image processing method comprises the steps of acquiring position information of the same target object in at least two video frames of an acquired first video sequence in a first data format; intercepting areas corresponding to the position information from each video frame respectively to obtain at least two first images; and carrying out image processing on all the first images in the first data format to obtain target images in a second data format, wherein the second data format is suitable for displaying and/or transmitting the target images. The information loss of the output image of the imaging device can be reduced, and the image quality is effectively improved.

Description

Image processing method, device and equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

In the field of video monitoring, as the requirements of people on the quality of a monitoring image are higher and higher, the quality of the monitoring image is improved. Currently, with the development of computer technology, image processing technology has greatly progressed, and under normal scenes, the quality of an image output by a camera is better. However, in some severe usage environments, such as a foggy day, a rainy day, a night, or a severe motion of a monitored object or a distance from the imaging device, the image collected by the imaging device may be affected by factors such as smoke, rainwater, noise, blur, resolution, etc., resulting in poor image quality, and thus, it is necessary to process the image to improve its quality.

For the monitoring image, people pay more attention to the interested target, so that only the target area is subjected to image processing, camera resources can be saved, and the requirements of customers can be met.

After the image is acquired by the imaging device, the image is generally input to a processing chip for processing (the processing may include bit width compression, image processing, codec processing, etc.), then output to a receiving end, and finally image quality enhancement is realized at the receiving end. The image after a series of bit width compression and encoding and decoding processes loses part of original information, and the image is enhanced on the basis, so that the image quality is difficult to be greatly improved.

Disclosure of Invention

In view of the above, the present invention provides an image processing method, apparatus, device, and storage medium, which can reduce information loss of an output image of an imaging device and effectively improve image quality.

A first aspect of the present invention provides an image processing method applied to an imaging apparatus, including:

acquiring position information of the same target object in at least two video frames of a video sequence from a first video sequence in an acquired first data format;

intercepting areas corresponding to the position information from each video frame respectively to obtain at least two first images;

And carrying out image processing on all the first images in the first data format to obtain target images in a second data format, wherein the second data format is suitable for displaying and/or transmitting the target images.

According to one embodiment of the invention, acquiring position information of the same target object in at least two video frames of a first video sequence in an acquired first data format from the video sequence comprises:

converting each video frame in the first video sequence into a first candidate image, and performing target detection processing on each first candidate image to obtain the position information of a target object in each video frame of the first video sequence;

and selecting the position information of the target object in at least two video frames from the detected position information.

According to one embodiment of the present invention, each video frame in the first video sequence is converted into a first candidate image, and the target detection processing is performed on each first candidate image to obtain the position information of the target object in each video frame of the first video sequence, including:

inputting the first video sequence into a trained first neural network, performing color processing on each video frame in the first video sequence by a color processing layer of the first neural network to obtain first candidate graphs, and performing target detection processing on each first candidate graph by at least one convolution layer of the first neural network to obtain position information of a target object in each video frame of the first video sequence;

The color processing layer is used for performing at least one of the following color processing: the method comprises the steps of graying processing, color channel separation processing and color information recombination processing, wherein the color processing layer at least comprises a specified convolution layer, and the step length of convolution kernel movement of the specified convolution layer is an integer multiple of the minimum unit of the color arrangement mode of the video frame.

According to one embodiment of the present invention, converting each video frame in the first video sequence into a first candidate map includes:

converting each video frame in the video sequence into a first candidate graph by adopting a preprocessing mode; the pretreatment mode at least comprises the following steps: color interpolation.

According to one embodiment of the present invention, performing a target detection process on a first candidate image to obtain location information of a target object in each video frame in a first video sequence, including:

and inputting each first candidate graph into a trained second neural network, and performing target detection processing on each first candidate graph by at least one convolution layer of the second neural network to obtain the position information of a target object in each video frame of the first video sequence.

In accordance with one embodiment of the present invention,

the location information includes: coordinates of a designated point on the target object, and a first dimension characterizing the size of the target object;

Intercepting areas corresponding to the position information from each video frame respectively to obtain at least two first images, wherein the first images comprise:

for each video frame, determining a reference position required by interception according to the coordinates of the target object in the position information of the video frame and a first size, intercepting a region with a preset size in the video frame by taking the reference position as a reference, and determining the intercepted region as a first image;

or,

for each video frame, taking the coordinates of the target object in the position information of the video frame as a reference, and cutting out a region with a first size in the video frame; and adjusting the cut-out area from the first size to the target size in a scaling or edge expansion mode, and determining the adjusted area as a first image.

According to one embodiment of the present invention, image processing is performed on all first images in a first data format to obtain a target image in a second data format, including:

converting each first image into a second candidate image;

aligning the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs;

and carrying out image processing on each third candidate graph to obtain the target image, wherein the image processing at least comprises fusion processing.

According to an embodiment of the present invention, the converting each first image into a second candidate image, aligning a position of a target object in each second candidate image to obtain aligned third candidate images, and performing image processing on each third candidate image to obtain the target image includes:

inputting each first image into a trained third neural network, carrying out color processing on each first image input by a color processing layer of the third neural network to obtain second candidate images, aligning the positions of a target object in each second candidate image by a sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and carrying out image processing on each third candidate image by a sequence processing sub-network of the third neural network to obtain the target image;

wherein the color processing layer is configured to perform at least one of the following color processes: the method comprises the steps of graying processing, color channel separation processing and color information recombination processing, wherein the color processing layer at least comprises a specified convolution layer, and the step length of convolution kernel movement of the specified convolution layer is an integer multiple of the minimum unit of the color arrangement mode of the video frame.

According to one embodiment of the invention, converting each first image into a second candidate image includes:

converting each first image into a second candidate image by adopting a preprocessing mode; the pretreatment mode at least comprises the following steps: color interpolation.

According to one embodiment of the present invention, aligning positions of a target object in each second candidate image to obtain aligned third candidate images, and performing image processing on each third candidate image to obtain the target image includes:

and inputting each second candidate graph into a trained fourth neural network, aligning the positions of the target objects in each second candidate graph by a sequence alignment sub-network of the fourth neural network to obtain aligned third candidate graphs, and carrying out image processing on each third candidate graph by a sequence processing sub-network of the third neural network to obtain the target image.

According to an embodiment of the present invention, the aligning the positions of the target objects in the second candidate graphs by the sequence alignment sub-network to obtain an aligned third candidate graph includes:

the sequence alignment sub-network estimates a motion vector of the target object from the position in each non-reference candidate graph to the position of the reference candidate graph at least through a convolution layer, and performs motion compensation on the target object of each non-reference candidate graph according to the motion vector so as to align the target object in each non-reference candidate graph with the position in the reference candidate graph; the reference candidate graph is one candidate graph in all second candidate graphs, and the non-reference candidate graph is a candidate graph except the reference candidate graph in all second candidate graphs;

The sequence alignment sub-network determines the compensated non-reference candidate map and the reference candidate map as an aligned third candidate map.

According to one embodiment of the present invention, the sequence processing sub-network performs image processing on each third candidate image to obtain the target image, including:

and the sequence processing sub-network performs channel combination on the third candidate image at least through a combination layer to obtain a multi-channel image, and performs image processing on the multi-channel image through at least one convolution layer to obtain the target image.

According to one embodiment of the present invention, performing image processing on the aligned image to obtain the target image includes:

mapping each pixel value in each third candidate image into a designated image according to a preset mapping relation, and taking the designated image obtained after mapping as the target image, wherein the resolution of the designated image is larger than that of each aligned image;

or,

and calculating the statistic value of the pixel value at the same position in each third candidate graph, and generating the target image according to the statistic value at each position.

According to one embodiment of the present invention, the image processing of each third candidate image to obtain the target image includes:

Respectively carrying out enhancement processing on each third candidate image, and carrying out fusion processing on each third candidate image after the enhancement processing to obtain the target image;

or,

and carrying out fusion processing on each third candidate image to obtain a reference image, and carrying out enhancement processing on the reference image to obtain the target image.

A second aspect of the present invention provides an image processing apparatus applied to an image forming device, comprising:

the first processing module is used for acquiring the position information of the same target object in at least two video frames of the video sequence from the first video sequence in the acquired first data format;

the second processing module is used for respectively intercepting the areas corresponding to the position information from each video frame to obtain at least two first images;

and the third processing module is used for carrying out image processing on all the first images in the first data format to obtain target images in a second data format, and the second data format is suitable for displaying and/or transmitting the target images.

According to one embodiment of the present invention, the first processing module is specifically configured to, when acquiring, from a first video sequence in an acquired first data format, position information of a same target object in at least two video frames of the video sequence:

According to one embodiment of the present invention, when the first processing module converts each video frame in the first video sequence into the first candidate image, the first processing module is specifically configured to:

According to an embodiment of the present invention, when the first processing module performs the target detection processing on the first candidate map to obtain the position information of the target object in each video frame in the first video sequence, the first processing module is specifically configured to:

In accordance with one embodiment of the present invention,

the second processing module is used for intercepting areas corresponding to the position information from each video frame respectively to obtain at least two first images, and is specifically used for:

Or,

According to an embodiment of the present invention, when the third processing module performs image processing on all the first images in the first data format to obtain the target image in the second data format, the third processing module is specifically configured to:

converting each first image into a second candidate image;

According to an embodiment of the present invention, the third processing module converts each first image into a second candidate image, aligns a position of a target object in each second candidate image to obtain aligned third candidate images, and when performing image processing on each third candidate image to obtain the target image, the third processing module is specifically configured to:

According to one embodiment of the present invention, when the third processing module converts each first image into the second candidate image, the third processing module is specifically configured to:

According to an embodiment of the present invention, the third processing module aligns a position of a target object in each second candidate image to obtain aligned third candidate images, and when performing image processing on each third candidate image to obtain the target image, the third processing module is specifically configured to:

According to an embodiment of the present invention, when the third processing module performs image processing on the aligned image to obtain the target image, the third processing module is specifically configured to:

or,

According to an embodiment of the present invention, when the third processing module performs image processing on each third candidate image to obtain the target image, the third processing module is specifically configured to:

Or,

A third aspect of the invention provides an electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the image processing method as described in the foregoing embodiment.

A fourth aspect of the present invention provides a machine-readable storage medium, having stored thereon a program which, when executed by a processor, implements an image processing method as described in the previous embodiments.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the position information of the target object in at least two video frames is acquired from the first video sequence of the acquired first data format, the region corresponding to the position information is intercepted from the video frames according to the acquired position information to serve as the first image, the data format of each first image is the first data format, compared with the image of the second data format after bit width compression, image processing and encoding and decoding processing, on one hand, the high precision of the original data is maintained, more time-space domain original image information is contained, the first images of all the first data formats are subjected to image processing to obtain the target image of one frame of the second data format, the original information in each first image is integrated by the target image, the complementation of the inter-frame information is realized, and the image information is richer.

On the other hand, in the imaging process, the acquired image is inevitably affected by degradation to a certain extent, but the image in the first data format is not subjected to any nonlinear processing, and the degradation distribution form of the image is not destroyed, so that the image in the first data format is restored, the image is more easily degraded reversely, the degradation effect caused by noise, blurring and the like is removed, and the image quality is improved.

Drawings

FIG. 1 is a flow chart of an image processing method according to an embodiment of the invention;

FIG. 2 is a block diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 3 is a block diagram showing a first processing module in an image processing apparatus according to an embodiment of the present invention;

FIG. 4 is a block diagram of a first neural network invoked by the first processing module of FIG. 3;

FIG. 5 is a block diagram illustrating a first neural network invoked by a first processing module according to another embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a color processing process according to an embodiment of the invention;

FIG. 7 is a block diagram illustrating a first processing module according to another embodiment of the present invention;

FIG. 8 is a block diagram illustrating a configuration of a preprocessing unit in the first processing module of FIG. 7;

FIG. 9 is a schematic diagram of a color interpolation process according to an embodiment of the invention;

Fig. 10 is a block diagram showing a third processing module in the image processing apparatus according to an embodiment of the present invention;

fig. 11 is a block diagram showing a configuration of a third processing module in an image processing apparatus according to another embodiment of the present invention;

fig. 12 is a block diagram showing the structure of a third processing module in the image processing apparatus according to still another embodiment of the present invention;

fig. 13 is a block diagram showing a third processing module in an image processing apparatus according to still another embodiment of the present invention;

FIG. 14 is a block diagram illustrating a fourth neural network invoked by a network invocation unit according to an embodiment of the present invention;

FIG. 15 is a block diagram of a sequence alignment sub-network in a fourth neural network according to an embodiment of the present invention;

FIG. 16 is a block diagram of a sequence processing sub-network in a fourth neural network according to an embodiment of the present invention;

fig. 17 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to make the description of the present invention clearer and more concise, some technical terms of the present invention are explained below:

Neural Network (Neural Network): the neural network is a network technology which simulates the abstraction of the brain information processing process, and mainly comprises neurons; its artificial neurons can respond to surrounding cells within a part of the coverage area, have excellent performance for large image processing, and can include convolution layers (Convolutional Layer) and Pooling layers (Pooling layers), etc.

And (3) dead pixel correction: the function is to solve the defect that the information received by some pixel points of the sensor on the imaging equipment is abnormal, the simple processing mode is to adopt a 3X 3 mean filter to carry out filtering processing on the data output by the sensor, namely

Where x (i, j) is the pixel value to be filtered and y (i, j) is the filtered pixel value.

Black level correction: the effect of the current bias in the camera is removed, and the following formula is specifically realized:

IMG _out ＝IMG _in -V _blc

wherein, IMG _in For inputting images, IMG _out To output an image, V _blc Is a black level value.

Brightness adjustment: the effect is to self-adaptively adjust the brightness of the image to a proper value, and improve the contrast of the image, which can be realized by self-adaptive gain, curve mapping or combination of gain and curve mapping, for example, the realization principle of self-adaptive gain is as follows:

wherein, IMG _in For inputting images, IMG _out To output an image, mean (IMG _in ) And representing the average value of the input image, wherein M is a preset proper value.

Color correction: the function is to correct the color cast of the camera sensor, usually the color correction matrix is adopted for adjustment, and the calculation formula is as follows:

y＝x·Mat+Offset

wherein x is input, y is output, mat and Offset are respectively a color correction matrix and Offset, and can be obtained through calibration.

Format conversion: the function is to adapt to different back-end devices, convert the current image format according to the image format required by the back-end devices, and introduce the conversion from RGB format to YUV format by taking the following formula as an example, and realize the conversion from RGB format to YUV format of the image:

Y＝0.299*R+0.587*G+0.114*B

U＝-0.169*R-0.331*G+0.5*B

V＝0.5*R-0.419*G-0.081*B。

the image processing method according to the embodiment of the present invention is described in more detail below, but is not limited thereto.

In one embodiment, referring to FIG. 1, there is shown an image processing method of an embodiment of the present invention, which may include the steps of:

s100: acquiring position information of the same target object in at least two video frames of a video sequence from a first video sequence in an acquired first data format;

s200: intercepting areas corresponding to the position information from each video frame respectively to obtain at least two first images;

S300: and carrying out image processing on all the first images in the first data format to obtain target images in a second data format, wherein the second data format is suitable for displaying and/or transmitting the target images.

In an embodiment of the present invention, the image processing method may be applied to an imaging apparatus, which may be a camera, and the steps S100 to S300 described above are performed by the camera. The video sequence in the first data format may be a video sequence obtained by the imaging device continuously acquiring images at a frequency.

The first data format refers to the original data format in which the image sensor converts the captured light source signals into digital signals, the original data being sensed data containing signals from one or more spectral bands. The raw data may include sensed data obtained by sampling optical signals in the spectral band of wavelengths from 380nm to 780nm, and/or from 780nm to 2500 nm. For example, RAW image signals obtained by sensing RGB sensors.

The image in the first data format is data in which the imaging device converts the acquired light source signals into digital signals. Specifically, the principle of image acquisition by an imaging device is generally as follows: the method comprises the steps of collecting light source signals, converting the collected light source signals into analog signals, converting the analog signals into digital signals, inputting the digital signals into a processing chip for processing (the processing can comprise bit width clipping, image processing, encoding and decoding processing and the like), obtaining data in a second data format, and transmitting the data in the second data format to display equipment for display or other equipment for processing. The first data format is an image when imaging and converting the collected light source information into a digital signal, the image is not processed by the processing chip, and compared with an image in a second data format which is processed by bit width clipping, image processing and encoding and decoding, the image contains rich image information.

In step S100, position information of the same target object in at least two video frames of the video sequence is acquired from a first video sequence in an acquired first data format.

The first video sequence may include a plurality of video frames, each video frame may be subjected to object detection, when at least two video frames include the same object, the video frame including the object may be used as a video frame, and position information of the object in each video frame may be determined, so that position information of the same object in at least two video frames of the video sequence may be obtained. The target object is an object of interest, and it is desired to improve imaging quality.

The location information of the target object in the video frame may include: coordinates of feature points of the target object in the video frame, and a size of the target object in the video frame; or coordinates of a start point and an end point of the target object detection frame, and the like. The positional information is not particularly limited as long as it is capable of locating a target object in a video frame.

The type of target object is not limited, such as text, vehicles, license plates, buildings, etc., and the shape and size are also not limited. The video frames in the first data format can be converted into common data capable of performing target detection by preprocessing, and then the target detection is performed; the target detection can also be directly performed on the video frame in the first data format to obtain the position information, and the specific implementation mode is not limited.

Next, step S200 is performed to extract at least two first images from the respective video frames by capturing the regions corresponding to the respective position information.

Each video frame is an image in the first data format acquired by the imaging device, that is, an original image that has not undergone the processing of losing the original image information, and is not an image that has undergone the processing in order to acquire the position information of the target object in step S100.

For each video frame, according to the position information of the target object in the video frame, a region corresponding to the position information can be intercepted from the video frame, and a first image is obtained. In this way, at least two first images can be obtained. Since the position information of the target object is obtained from the video frame, the region corresponding to the position information in the video frame is the region where the target object is located, and thus the target object is included in the first image.

Since the first image is an area taken from the acquired video frame in the first data format, the data format of the first image is also the first data format, i.e. the original data format in which the image sensor converts the captured light source signal into a digital signal, comprising the original image information.

Then, step S300 is executed, where all the first images in the first data format are subjected to image processing to obtain a target image in a second data format, where the second data format is suitable for displaying and/or transmitting the target image.

The image processing of step S300 may include image quality enhancement processing including: enhancement in terms of brightness, color, sharpness, resolution, dynamic range, etc. Of course, the image processing includes at least a fusion process for integrating the multi-frame image information. In the fusion process, the method can be combined with the technologies of super resolution, denoising, deblurring, dynamic range adjustment and the like to realize image restoration and improve the image quality in the fusion process, and is not limited herein.

Of course, the image processing may also include other image preprocessing and post-processing operations, for example, the preprocessing may include dead pixel correction, black level correction, white balance correction, etc., the post-processing may include brightness adjustment, color correction, etc., and may also include color interpolation, etc., and the specific processing manner is not limited thereto.

After image processing is performed on at least two first images, the imaging device may output a frame of high-quality target image in a second data format, where the second data format is a data format suitable for displaying and/or transmitting the target image, for example, may be an RGB format or a YUV format.

Through the image processing, the complementary information of at least two first images is integrated, the information quantity of the images is improved by means of the inter-frame complementary information, the enhancement of the images is realized, the target images with better image quality are obtained, and favorable conditions are provided for follow-up intelligent recognition.

In one embodiment, the above-described method flow may be performed by an image processing device, which may be a device in a video camera. As shown in fig. 2, the image processing apparatus 100 mainly includes 3 modules: a first processing module 101, a second processing module 102 and a third processing module 103. The first processing module 101 is configured to perform the step S100, the second processing module 102 is configured to perform the step S200, and the third processing module 103 is configured to perform the step S300.

As shown in fig. 2, a first video sequence in a first data format acquired by the imaging device is simultaneously input to the first processing module 101 and the second processing module 102, and may be input once every time a video frame is acquired, for example, an acquired T-moment video frame is simultaneously input to the first processing module 101 and the second processing module 102; the first processing module 101 performs target detection on the received video frame to obtain position information of a target object on the input video frame, and outputs the position information to the second processing module 102; then, the second processing module 102 intercepts an area corresponding to the position information on the inputted video frame in the first data format according to the position information to obtain a first image, and stores the first image into a cache; finally, the first image from the time T-N to the time T in the buffer is used as the input of the third processing module 103 (T-N is the time of a certain video frame acquired before the time T), and the image processing is performed on the first image in the first data format with high bit width, so as to output the target image in the second data format with high quality.

The first processing module 101 at least includes a target detection process, and may also include a series of processes such as target tracking, target scoring, target capturing, and the like, and finally outputs the position information of the detected target object on the input video frame. The first processing module 101 may be implemented by a conventional method, a deep learning technology, or the like, and the detected target object may include a license plate, a vehicle, an animal, or the like, and the manner adopted by the first processing module 101 and the target object are not limited herein.

The first processing module 101 inputs a first video sequence in a first data format, and may convert the first data format into an input data format required for object detection before the object detection, so as to ensure the performance of the object detection.

In one embodiment, in step S100, acquiring, from a first video sequence in an acquired first data format, position information of a same target object in at least two video frames of the video sequence, includes the steps of:

s101: converting each video frame in the first video sequence into a first candidate image, and performing target detection processing on each first candidate image to obtain the position information of a target object in each video frame of the first video sequence;

S102: and selecting the position information of the target object in at least two video frames from the detected position information.

In step S101, each video frame in the first video sequence is converted into a first candidate image, so that the first candidate image can be suitable for target detection, and then target detection processing is performed on each first candidate image, so as to ensure the performance of target detection. The first candidate image may be a feature image of the video frame, or may be an image of the video frame after a certain process, and the specific form is not limited.

And carrying out target detection processing on each first candidate graph, and determining the position information of the target object in the video frames from the first candidate graph, so that the position information of the target object in each video frame of the first video sequence can be obtained. Of course, if the target object does not exist in a video frame, the target object is not detected, and the corresponding position information is not obtained.

In step S102, position information of the target object in at least two video frames is selected from the detected position information. The position information of the target object detected on the video frame is taken as the position information of the target object.

In this embodiment, in step S200, after the target detection is completed by using the first candidate map, the region is not cut out from the first candidate map, but is cut out from the video frame in the first data format, so as to avoid the loss of the original image information.

In one embodiment, referring to fig. 3, the first processing module 101 may include a target detection unit 1011, and the step S101 may be performed by the target detection unit 1011.

In one embodiment, converting each video frame in the first video sequence into a first candidate image, performing object detection processing on each first candidate image to obtain position information of a target object in each video frame of the first video sequence, including:

In the present embodiment, the above step S101 is performed by the target detection unit 1011, and the target detection unit 1011 is implemented by a deep learning technique. The first neural network may be preset in the target detection unit 1011, and the target detection unit 1011 is invoked locally when needed; alternatively, the first neural network may be set in advance in another unit or another device, and the object detection unit 1011 is invoked from the outside when necessary.

As one embodiment of the first neural network, referring to fig. 4, the first neural network 401 may include a color processing layer and at least one convolution layer Conv. And performing color processing on each video frame in the first video sequence through a color processing layer to obtain a first candidate graph capable of performing target detection. And performing target detection processing on each first candidate graph through at least one convolution layer Conv to obtain the position information of the target object in each video frame of the first video sequence.

As another embodiment of the first neural network, referring to fig. 5, the first neural network may include a color processing layer, a convolution layer Conv, a pooling layer Pool … convolution layer Conv, a pooling layer Pool, a full connection layer FC, a frame regression layer BBR. And performing color processing on each video frame in the first video sequence through a color processing layer to obtain a first candidate graph capable of performing target detection. And performing target detection processing on each first candidate graph through a convolution layer Conv, a pooling layer Pool … convolution layer Conv, a pooling layer Pool, a full connection layer FC and a frame regression layer BBR to obtain the position information of a target object in each video frame of the first video sequence.

The color processing layer is used for performing color processing on the image in the first data format, including graying processing, color channel separation processing, color information recombination processing and the like, so that the network can more effectively extract information on the video frame in the first data format, and the distinguishing degree of data characteristics is improved. The color processing layer comprises at least one specified convolution layer, and the step length of the convolution kernel movement of the specified convolution layer is an integer multiple of the minimum unit of the color arrangement mode of the video frame. Taking color channel separation as an example, the processing procedure is as shown in fig. 6, channels with different colors in an input video frame are arranged in a mixed mode, and a filter kernel is adopted as [1,0;0,0], [0,1;0,0], [0,0;1,0], [0,0; the 0,1 filter filters the input video frames in turn with the step length of 2, thus realizing the separation of color channels. Of course, the color processing layer may also include deconvolution layers, merging layers, etc., and the specific layer structure is not limited.

The function of the convolutional layer Conv is in fact a filtering process, and the implementation of a convolutional layer can be expressed by the following formula:

F _i (I1)＝g(W _i *F _i-1 (I1)+B _i )

wherein F is _i (I1) For the output of the current convolution layer, F _i-1 (I1) For the input of the current convolution layer, the convolution operation, W _i And B _i The weight coefficient and the offset coefficient of the convolution filter of the current convolution layer, respectively, g () represents the activation function, and when the activation function is ReLU, g (x) =max (0, x)The convolution layer may output a feature map.

The pooling layer Pool is a special downsampling layer, namely, the input feature images are compressed and aggregated, the pooling window is assumed to be N1×N1, when the maximum pooling is used, namely, the pooling window is adopted to compress the input feature images with the step length of N1, the maximum value in the pooling window is taken as the value of the corresponding position of the output feature images, and the specific formula is as follows:

F _i (I)＝maxpool(F _i-1 (I))

the fully connected layer FC can be regarded as a convolution layer with a filter window of 1×1, and the specific implementation is similar to convolution filtering, and the expression is as follows:

wherein F is _i-1 (I2 (m, n)) is the input of the full connection layer, F _i (I2) For the output of the fully connected layer, R, C is the width and height of the input feature,and->The connection weight and bias, respectively, for the current fully connected layer, g () represents the activation function.

The frame regression BBR is used for searching a relation so that the window P output by the full connection layer is mapped to obtain a window G which is closer to the real window G ^′ . Regression is typically implemented by translating or scaling the window P. Let the coordinates of the window P output by the full link layer be (x ₁ ,x ₂ ,y ₁ ,y ₂ ) The transformed window coordinates (x ₃ ,x ₄ ,y ₃ ,y ₄ ) If the transformation is a translation transformation, the translation scale is (Δx, Δy), and the coordinate relationship before and after the translation is:

x ₃ ＝x ₁ +Δx

x ₄ ＝x ₂ +Δx

y ₃ ＝y ₁ +Δy

y ₄ ＝y ₂ +Δy

if the transformation is scaling transformation, the scaling scale in the X, Y direction is dx and dy respectively, and the coordinate relation before and after transformation is as follows:

x ₄ -x ₃ ＝(x ₂ -x ₁ )*dx

y ₄ -y ₃ ＝(y ₂ -y ₁ )*dy

it will be appreciated that the first neural network described above is merely an example and is not particularly limited thereto, as convolutional layers, and/or pooling layers, and/or other layers may be reduced or added, for example.

For training of the first neural network, the model can be trained by acquiring a video frame sample and corresponding position information as a training sample set, taking the video frame sample as input, and corresponding position information as output, so as to obtain the trained first neural network.

In one embodiment, referring to fig. 7, the first processing module 101 may include a preprocessing unit 1012, a target detection unit 1013, a target tracking unit 1014, and a target capturing unit 1015, where the step S101 may be performed by the preprocessing unit 1012 and the target detection unit 1013, and the step S102 may be performed by the target tracking unit 1014 and the target capturing unit 1015.

In one embodiment, in step S101, converting each video frame in the first video sequence into the first candidate image may be performed by the preprocessing unit 1012, including:

Of course, the preprocessing method may also include other methods, such as white balance correction, curve mapping, and the like.

As an embodiment of the preprocessing unit 1012, referring to fig. 8, the preprocessing unit 1012 may include a white balance correction subunit, a color interpolation subunit, and a curve mapping subunit, and the video frame is processed by the white balance correction subunit, the color interpolation subunit, and the curve mapping subunit in sequence to obtain a first candidate map.

The white balance correction subunit is configured to perform white balance correction. The white balance correction is to remove image color cast of the image due to ambient light influence to restore original color information of the image, and is generally performed by a gain factor R _gain 、G _gain 、B _gain To adjust the corresponding R, G, B component:

R′＝R*R _gain

G′＝G*G _gain

B′＝B*B _gain

wherein R, G, B is the input image IMG of the white balance syndrome unit _in R ', G ', B ' are the output images IMG of the white balance corrector unit _awb Is included in the color component of the color component (c).

The color interpolation subunit is configured to perform color interpolation. The color interpolation is to convert the single-channel image into an RGB three-channel image, and the nearest-neighbor interpolation method is taken as an example for description, and the pixels with corresponding color missing are directly filled with nearest-neighbor color pixels for the single-channel image, so that each pixel contains three color components of RGB, and the specific interpolation situation can be seen in fig. 9, and is not repeated here.

The curve mapping subunit is configured to perform curve mapping. The curve mapping is to enhance brightness and contrast of the image, and a Gamma curve mapping is commonly used, that is, the image is linearly mapped according to a Gamma table, and the formula is as follows:

IMG _gamma (i,j)＝Gamma(IMG _cfa (i,j))

wherein, IMG _cfa (i, j) is the image before curve mapping, IMG _gamma (i, j) is a curve-mapped image.

In one embodiment, in step S101, performing object detection processing on the first candidate image to obtain location information of the object in each video frame in the first video sequence, which may be completed by the object detection unit 1013, including:

The second neural network may be preset in the target detection unit 1013, and the target detection unit 1013 is called locally when necessary; alternatively, the second neural network may be preset in another unit or another device, and the object detection unit 1013 is invoked from the outside when necessary.

In this embodiment, each video frame in the first video sequence is converted into a first candidate image capable of performing object detection, and the first candidate image is processed outside the second neural network, where the second neural network is mainly responsible for performing object detection processing on each first candidate image, so as to obtain position information of the object in each video frame of the first video sequence.

The second neural network comprises at least one convolution layer, the number of convolution layers being unlimited. Of course, the structure of the second neural network is not limited thereto, and may include a pooling layer, and/or other layers.

In one embodiment, selecting the position information of the target object in at least two video frames from the detected position information may be performed by the target tracking unit 1014 and the target capturing unit 1015, including:

tracking the detected position information to obtain the position information of the same target object in each video frame, and calculating the score of the target object in each video frame, wherein the score is used for evaluating the quality of the target object in the video frame and comprises the gesture, shape, size, definition and the like of the target object;

and selecting the position information of the target object in at least two video frames from the detected position information according to the scores of the target object.

The score may be determined according to the pose, size, sharpness, etc. of the target object in the video frame, and is not particularly limited as long as the quality of the target object in the video frame can be evaluated.

When the position information of the target object in at least two video frames is selected from the detected position information according to the scores of the target objects, the video frames with the scores higher than the designated scores or the scores positioned in the first M video frames in the first video sequence can be used as video frames, M is greater than 1, and the position information of the target object in the selected video frames is the position information of the target object in the video frames.

Taking the first processing module 101 in fig. 7 as an example for expansion and explanation, the video frame in the first data format acquired by the imaging device is first converted into a first candidate image suitable for performing the target detection processing by the preprocessing unit 1012, where the first candidate image is in a data format suitable for being input into the target detection unit 1013, such as a second data format, and the first candidate image is subjected to the target detection processing by the target detection unit 1013, so that each target (may be an area where the target object is located) on the video frame and its position information can be output; then, the target tracking unit 1014 tracks and evaluates each target object, records the position information and the score of each target object, and when a certain target object no longer appears in one video frame, ends tracking the target object to obtain the position information and the score of the same target object in each video frame; the target snapshot unit 1015 may select a target and its position information according to a preset selection policy, where a video frame where the selected target is located is a video frame.

The selection policy may be set to select a target with better quality and output the target and its position information, and of course, the selection policy may also be set to output the target and its position information every frame, and the selection policy is not limited.

Optionally, the target detection unit 1013, the target tracking unit 1014, and the target snap-shot unit 1015 may each implement a corresponding function using a neural network, for example, at least one convolution layer.

In one embodiment, the location information may include: coordinates of a designated point on the target object, and a first dimension characterizing the size of the target object.

The input of the second processing module 102 is a video frame in a first data format and the position information of each target object output by the first processing module 101 on the video frame, the second processing module 102 intercepts an area corresponding to the position information in the input video frame according to the position information of the target object, so as to obtain corresponding first images, and each first image can be stored in a cache.

The intercepting principles are two, and the first principle is that the sizes of the areas of the same target object are consistent during intercepting; the second principle is to intercept the region according to the detection frame, and then unify the region size of the same target object by adopting methods such as edge expansion, scaling and the like.

In the case of the first principle, in step S200, the capturing, from each video frame, the area corresponding to each position information to obtain at least two first images may include:

S201: and for each video frame, determining a reference position required by interception according to the coordinates of the target object in the position information of the video frame and a first size, intercepting an area with a preset size in the video frame by taking the reference position as a reference, and determining the intercepted area as a first image.

Assuming that the video frame at time t passes through the first processing module 101, the position information of the output target object is [ x ] ^t ,y ^t ,h ^t ,w ^t ]Wherein x is ^t ,y ^t For the coordinates of a specified point on the target object, such as the coordinates of the starting point of the detection frame, h ^t ,w ^t The preset sizes are the height H, the width W and the height and the width M of the video frame at the moment t for the first size of the target object, such as the height and the width of the detection frame ^t ,N ^t The intercepted areaThe method comprises the following steps:

wherein if x ^t -a ^t <0 or y ^t -b ^t <0, then x ^t -a ^t ＝0，y ^t -b ^t =0; if x ^t -a ^t >M ^t -H or y ^t -b ^t >N ^t W is then x ^t -a ^t ＝M ^t -H，y ^t -b ^t ＝N ^t -W。

In (x) ^t -a ^t ，y ^t -b ^t ) Can be used as a reference position, a ^t ,b ^t The value of (2) can be determined according to the requirement of the interception mode. If the center of the detection frame is taken as the center of the intercepting area when the intercepting area is intercepted, then,if the starting point of the detection frame is taken as the starting point when the area is intercepted, a ^t ＝0,b ^t When x is =0 ^t +h ^t >H，y ^t +w ^t >W, resetting the starting point of the interception area.

In the case of the second principle, in step S200, the capturing, from each video frame, the area corresponding to each position information to obtain at least two first images may include:

S201: for each video frame, taking the coordinates of the target object in the position information of the video frame as a reference, and cutting out a region with a first size in the video frame; and adjusting the cut-out area from the first size to the target size in a scaling or edge expansion mode, and determining the adjusted area as a first image.

Assuming that the video frame at time t passes through the first processing module 101, the position information of the output target object is [ x ] ^t ,y ^t ,h ^t ,w ^t ]Wherein x is ^t ,y ^t For the coordinates of a specified point on the target object, such as the coordinates of the starting point of the detection frame, h ^t ,w ^t The first dimension of the size of the target object, such as the height and width of the detection frame, and the height and width of the video frame at the time t are respectively M ^t ,N ^t The intercepted areaThe method comprises the following steps:

wherein if x ^t <0 or y ^t <0, then x ^t ＝0，y ^t =0; if x ^t >M ^t -h ^t Or y ^t >N ^t -w ^t X is then ^t ＝M ^t -h ^t ，y ^t ＝N ^t -w ^t 。

After the area is intercepted, the intercepted area is unified in size. The embodiment is described by taking a scaling mode as an example, and the maximum height and width of each area can be counted or one height and width can be set, and the height and width are taken as target dimensions, or one target dimension can be preset; and scaling each region to the target size, wherein the scaling mode can be bilinear interpolation, nearest interpolation and the like, and is not limited.

The third processing module 103 mainly processes the first images of the N first data formats, and outputs the target image of the second data format with high quality for subsequent display or intelligent analysis, where the processing includes, but is not limited to, adjusting brightness, color, sharpness, resolution, dynamic range, etc. of the target image.

The third processing module 103 may be implemented by a conventional method, and improves the image quality of the target area under limited resources. Alternatively, all or part of the units of the third processing module 103 may be implemented by using a deep learning technique, so as to reduce error propagation and accumulation between units in the conventional method.

In one embodiment, in step S300, performing image processing on all the first images in the first data format to obtain a target image in the second data format, including:

s301: converting each first image into a second candidate image;

s302: aligning the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs;

s303: and carrying out image processing on each third candidate graph to obtain the target image, wherein the image processing at least comprises fusion processing.

The position of the target object in the respective first images may be jagged, in which case an alignment operation of the first images is required. In step S301, each first image is converted into a second candidate image that can be used for performing an alignment operation, so that the second candidate images can be suitable for performing the alignment operation, and then the alignment operation is performed on each second candidate image, thereby ensuring the performance of the alignment operation. The second candidate image may be a feature image of the first image, or may be an image of the first image after a certain process, and the specific form is not limited.

In step S302, the positions of the target objects in the second candidate graphs are aligned to obtain an aligned third candidate graph. In this embodiment, the alignment may be implemented by a deep learning manner, or may be implemented by a conventional manner, and the specific alignment manner is not limited as long as the alignment of the target object in the position of each third candidate graph can be ensured.

In step S303, image processing is performed on each third candidate image to obtain the target image, where the image processing includes at least fusion processing. Based on the fusion processing, the information in each third candidate image can be fused to one target image, and the target image is fused with the information quantity of a plurality of frames of images, so that the information quantity is high, and the image quality is improved.

It will be appreciated that when each third candidate image is subjected to image processing, other processing manners may be included in addition to the fusion processing, for example, including: denoising, deblurring, super resolution, dynamic range adjustment, etc., as much as possible, improving image quality, without limitation.

In one embodiment, the third processing module 103 may implement steps S301-S303 using a trained third neural network, including:

In one implementation, referring to fig. 10, the third processing module 103 may include a network invoking unit 1031, where the network invoking unit 1031 may invoke the trained third neural network to implement steps S301-S303 described above. The third neural network may be preset in the network invoking unit 1031, and the network invoking unit 1031 invokes locally when necessary; alternatively, the third neural network may be set in advance in another unit or another device, and the network calling unit 1031 calls from the outside when necessary.

In this manner, the network invoking unit 1031 may input each of the cached first images into the third neural network, perform color processing on the input first images by using the color processing layer of the third neural network to obtain second candidate images capable of performing alignment operation, align the positions of the target objects in each of the second candidate images by using the sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and perform at least fusion processing on each of the third candidate images by using the sequence processing sub-network of the third neural network to obtain the target images, which are used as the output of the third processing module 103. The specific content of the color processing layer may be referred to the related description in the foregoing embodiments, and will not be repeated here.

The third neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super-resolution, dynamic range adjustment and the like is performed, so that the image quality is improved as much as possible; meanwhile, image preprocessing and post-processing may also be included.

The image preprocessing can include image correction processing modes such as dead pixel correction, black level correction, white balance correction and the like; the image post-processing includes curve mapping, color correction, format conversion and other image adjustment processing modes, and corrects and adjusts the color, brightness, format and the like of the image.

Of course, the image preprocessing and post-processing may be implemented in a conventional manner without being implemented in a neural network, and several different implementations are provided below based on the difference in the preprocessing and/or post-processing positions.

In another implementation, referring to fig. 11, the third processing module 103 may include a network invoking unit 1032 and a first processing unit 1033, where the network invoking unit 1032 may invoke the trained third neural network to implement the above steps S301-S303, and the first processing unit 1033 may implement the image post-processing. The third neural network may be preset in the network invoking unit 1032, and the network invoking unit 1032 invokes locally when necessary; alternatively, the third neural network may be preset in another unit or another device, and the network invoking unit 1032 invokes from the outside when necessary.

In this manner, the network invoking unit 1032 may input each of the cached first images into a third neural network, perform color processing on the input first images by using a color processing layer of the third neural network to obtain second candidate images capable of performing alignment operation, align positions of the target object in each of the second candidate images by using a sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and perform at least fusion processing on each of the third candidate images by using a sequence processing sub-network of the third neural network to obtain the target image; the first processing unit 1033 performs post-processing on the target image output by the third neural network.

In this manner, the third neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment and the like improves the image quality as much as possible. Of course, some image preprocessing may be implemented at the same time, and the image preprocessing may include image correction processing such as dead pixel correction, black level correction, white balance correction, and the like. When the image quality enhancement process is implemented in the third neural network, the first processing unit 1033 may mainly implement an image post-process including an image adjustment process such as curve mapping, color correction, format conversion, and the like, and correct and adjust the color, brightness, format, and the like of the image.

Of course, the image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment, etc. may also be implemented in the first processing unit 1033, so as to achieve improvement of image quality, where the first processing unit 1033 includes image quality enhancement processing and image post-processing.

In yet another implementation, referring to fig. 12, the third processing module 103 may include a second processing unit 1034, a network invoking unit 1035, and a third processing unit 1036, where the second processing unit 1034 may implement image preprocessing, the network invoking unit 1035 may invoke the trained third neural network to implement steps S301-S303 described above, and the third processing unit 1036 may implement image post-processing. The third neural network may be preset in the network invoking unit 1035, and the network invoking unit 1035 invokes locally when necessary; alternatively, the third neural network may be set in advance in another unit or another device, and the network invoking unit 1035 invokes from the outside when necessary.

In this manner, the second processing unit 1034 may perform preprocessing on the buffered first images, and may include: white balance processing, dead pixel correction, black level correction and the like; the network invoking unit 1035 may input each of the first images after the preprocessing into a third neural network, perform color processing on the input first images by using a color processing layer of the third neural network to obtain second candidate images capable of performing alignment operation, align positions of a target object in each of the second candidate images by using a sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and perform at least fusion processing on each of the third candidate images by using a sequence processing sub-network of the third neural network to obtain the target image; the third processing unit 1036 performs post-processing on the target image output by the third neural network, and may include: brightness adjustment, color correction, format conversion, etc.

In this manner, the third neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment and the like improves the image quality as much as possible. When the image quality enhancement process is implemented in the third neural network, the third processing unit 1036 may mainly implement image post-processing including image adjustment processes such as curve mapping, color correction, format conversion, and the like, and correct and adjust the color, brightness, format, and the like of the image.

In yet another implementation, referring to fig. 13, the third processing module 103 may include a fourth processing unit 1037 and a network invoking unit 1038, where the fourth processing unit 1037 may implement preprocessing of the image, and the network invoking unit 1038 may invoke the trained third neural network to implement steps S301-S303 described above. The third neural network may be preset in the network invoking unit 1038, and the network invoking unit 1038 invokes locally when necessary; alternatively, the third neural network may be set in advance in another unit or another device, and the network calling unit 1038 calls from the outside when necessary.

In this manner, the fourth processing unit 1037 may perform preprocessing on each of the buffered first images, and may include: white balance processing, dead pixel correction, black level correction and the like; the network invoking unit 1038 may input each of the first images after the preprocessing into a third neural network, perform color processing on the input first images by using a color processing layer of the third neural network to obtain second candidate images capable of performing alignment operation, align positions of the target object in each of the second candidate images by using a sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and perform at least fusion processing on each of the third candidate images by using a sequence processing sub-network of the third neural network to obtain the target image.

In this manner, the third neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment and the like improves the image quality as much as possible. Of course, some image post-processing can be realized at the same time, the image post-processing includes curve mapping, color correction, format conversion and other image adjustment processing, and the correction and adjustment of the color, brightness, format and the like of the image can be performed.

As an implementation manner of the third neural network, the third neural network may include a color processing layer, a sequence alignment sub-network Align-Net and a sequence processing sub-network VSP-Net, which are sequentially connected, where the color processing layer performs color processing on each input first image to obtain second candidate images capable of performing alignment operation, the sequence alignment sub-network Align-Net aligns positions of the target object in each second candidate image to obtain aligned third candidate images, and the sequence processing sub-network VSP-Net performs color interpolation and fusion processing on each third candidate image to obtain the target image.

In this embodiment, the specific content of the color processing layer may refer to the related descriptions in the foregoing embodiments, which are not repeated here.

In one embodiment, in step S301, converting each first image into a second candidate image includes:

In the preprocessing method, other preprocessing may be included besides color interpolation, and the specific method is not limited as long as the first image can be converted into the second candidate image suitable for the alignment operation.

In one embodiment, in steps S302 and S303, aligning the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs, and performing image processing on the third candidate graphs to obtain the target image, where the steps include:

The data format of the second candidate map suitable for performing the alignment operation is a format suitable for input to the fourth neural network, and is not particularly limited.

In one manner, referring to fig. 13, the third processing module 103 may include a fourth processing unit 1037 and a network invoking unit 1038, where the fourth processing unit 1037 may implement the above-mentioned step S301, and the network invoking unit 1038 may invoke the trained fourth neural network to implement the above-mentioned steps S302-S303. The fourth neural network may be preset in the network invoking unit 1038, and the network invoking unit 1035 invokes locally when necessary; alternatively, the fourth neural network may be set in advance in another unit or another device, and the network calling unit 1038 calls from the outside when necessary.

In this manner, the fourth processing unit 1037 may convert each of the buffered first images into the second candidate image capable of performing the alignment operation in a preprocessing manner including at least: the fourth processing unit 1037 may perform image preprocessing methods before performing color interpolation, and the image preprocessing methods include: dead pixel correction, black level correction, white balance correction and the like, and the data format of the second candidate image is a second data format; the network invoking unit 1038 may input each of the second candidate graphs output by the fourth processing unit 1037 into a fourth neural network, align the position of the target object in each of the second candidate graphs by using a sequence alignment sub-network of the fourth neural network to obtain aligned third candidate graphs, and perform fusion processing on each of the third candidate graphs by using a sequence processing sub-network of the fourth neural network to obtain the target image.

In this manner, the fourth neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment and the like improves the image quality as much as possible. Meanwhile, the fourth neural network can also realize image post-processing, wherein the image post-processing comprises curve mapping, color correction, format conversion and other image adjustment processing, and the correction and adjustment of the color, brightness, format and the like of the image are carried out.

Of course, the image post-processing may also not be implemented in the fourth neural network, and several implementations are provided below in which the image post-processing is not implemented in the fourth neural network.

In another implementation, with continued reference to fig. 13, the image post-processing may also be implemented in the fourth processing unit 1037, where the image pre-processing and the image post-processing are combined and implemented in one unit, and the fourth processing unit 1037 may include processes of dead point correction, black level correction, white balance correction, color interpolation, curve mapping, color correction, format conversion, and the like, to correct and convert the color, brightness, format, and the like of the image.

In yet another implementation, like the third processing module 103 of fig. 12, the third processing module 103 may include a second processing unit 1034, a network invocation unit 1035, and a third processing unit 1036. The second processing unit 1034 may convert each of the buffered first images into a second candidate image capable of alignment operation in a preprocessing manner, where the preprocessing manner at least includes: the second processing unit 1034 may also perform image preprocessing modes before performing color interpolation, where the image preprocessing modes include: dead pixel correction, black level correction, white balance correction, and the like; the network invoking unit 1035 may invoke the trained fourth neural network to implement steps S302-S303 described above, and the third processing unit 1036 may implement image post-processing. The fourth neural network may be preset in the network invoking unit 1035, and the network invoking unit 1035 invokes locally when necessary; alternatively, the fourth neural network may be set in advance in another unit or another device, and the network calling unit 1035 calls from the outside when necessary.

In this manner, the fourth neural network may implement other image processing in addition to the above processing, for example, including: image quality enhancement processing such as denoising, deblurring, super resolution, dynamic range adjustment and the like improves the image quality as much as possible.

As an implementation manner of the fourth neural network, referring to fig. 14, the fourth neural network 500 may include a sequence alignment sub-network Align-Net 501 and a sequence processing sub-network VSP-Net 502, where the sequence alignment sub-network Align the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs, and the sequence processing sub-network VSP-Net 502 performs fusion processing on the third candidate graphs to obtain the target images.

It will be appreciated that the above-described implementations are merely examples, and may be specifically adapted according to actual needs.

The alignment sub-network of the sequence can be implemented by various ways of aligning the positions of the target objects in the second candidate graphs to obtain the aligned third candidate graphs, such as a motion estimation and compensation way, a filter kernel alignment way, a deformable convolution way, etc., and one or more ways can be actually selected for implementation.

In one embodiment, the aligning the positions of the target objects in the second candidate graphs by the sequence alignment sub-network to obtain an aligned third candidate graph includes:

The reference candidate map may be, for example, a candidate map of the latest time (T time) in the second candidate map (a candidate map converted from the first image captured in the video frame captured at the T time). The aligning operation is to align non-reference candidate pictures in the second candidate picture to the reference candidate picture.

As an implementation of the sequence alignment sub-network, referring to fig. 15, the sequence alignment sub-network Align-Net 501 may include a convolutional layer Conv, a pooling layer pool, a convolutional layer Conv, an upsampling layer UpSample, a convolutional layer Conv, an activation layer Tanh, and a transform layer Warp, which are sequentially connected. And performing motion compensation on the target object of each non-reference candidate graph according to the motion vector through a transformation layer Warp.

Taking the example of aligning a second candidate diagram at the time of T-1 (a candidate diagram obtained by converting a first image captured in a video frame acquired at the time of T-1) to the second candidate diagram at the time of T, the sequence alignment sub-network Align-Net501 estimates a motion vector of a target object from a position in each non-reference candidate diagram to a position in a reference candidate diagram at the time of T-1 at least through a convolution layer Conv, a pooling layer pool, a convolution layer Conv, an up-sampling layer Upsample, a convolution layer Conv and an activation layer Tanh which are sequentially connected, and performs motion compensation on the second candidate diagram at the time of T-1 according to the vector through a transformation layer Warp so as to Align the second candidate diagram at the time of T-1 with the position in the second candidate diagram at the time of T-1, thereby obtaining a third candidate diagram at the time of T-1.

Motion compensation is achieved through Warp, and the calculation formula can be as follows:

wherein, (x) _p ,y _p ) The coordinates before adjustment for a point p1 on the target object,for the coordinates of this point p1, (u) _p ,v _p ) Is a motion vector.

It will be appreciated that the above-described structure of the sequence alignment sub-network is only a preferred embodiment, and that other ways are possible, and is not particularly limited to the above-described structure.

In one embodiment, the sequence processing sub-network performs image processing on each third candidate graph to obtain the target image, including:

And the sequence processing sub-network performs channel combination on the third candidate image at least through a combination layer to obtain a multi-channel image, and performs image processing on the multi-channel image through a convolution layer to obtain the target image.

As an implementation manner of the sequence processing subnetwork, referring to fig. 16, the sequence processing subnetwork VSP-Net502 includes a merging layer Concat and a convolution layer Conv, the merging layer Concat performs channel merging on the third candidate graph to obtain a multi-channel graph, and the convolution layer Conv performs image processing on the multi-channel graph to obtain the target image.

The sequence processing sub-network realizes the fusion of the multi-frame third candidate images through the merging layer Concat and the convolution layer Conv, and the processing extracts and integrates the complementary information among the multi-frame images, so that the information richness of the fused images is improved, and a high-quality target image is obtained.

It will be appreciated that the above described structure of the sequence handling sub-network is only a preferred embodiment, and that other ways are of course possible. The specific structure of the sequence processing sub-network may be designed according to the actual restoration problem, but at least includes a layer structure for implementing the fusion processing, and is not particularly limited to the above structure.

In one embodiment, in step S303, performing image processing on the aligned image to obtain the target image, including:

or,

In this embodiment, steps S301-S0303 may not be implemented by a neural network, and the third processing module 103 may include a fifth processing unit and a sixth processing unit (not shown in the figure). The fifth processing unit may convert each of the cached first images into the second candidate image capable of performing the alignment operation in a preprocessing manner, where the preprocessing manner at least includes: the fifth processing unit may perform other processes such as white balance correction, dead pixel correction, black level correction, brightness adjustment, color correction, and format conversion. The sixth processing unit may implement steps S302 and S303 described above.

In the image processing of the present embodiment, other image processing manners may be implemented besides fusion processing, including, for example: image quality enhancement processing modes such as demosaicing, denoising, deblurring, super-resolution, dynamic range adjustment and the like can be used for improving the image quality as much as possible. The multi-frame fusion process may generally be performed in conjunction with the image quality enhancement process.

In one mode of jointly implementing multi-frame fusion processing and image quality enhancement processing in this embodiment, each pixel value in each third candidate image is mapped into a specified image according to a preset mapping relationship, and the specified image obtained after mapping is used as the target image, so that the resolution of the image can be improved in the fusion process, and a high-resolution target image is obtained.

In another mode of jointly implementing multi-frame fusion processing and image quality enhancement processing in this embodiment, a statistical value of pixel values at the same position in each third candidate image is calculated, the target image is generated according to the statistical value at each position, and denoising of the image can be implemented in the fusion process, so that a denoised target image is obtained. The statistical value is, for example, a mean value, a median value, or the like, and is not particularly limited.

In the mode, the complementary information among the multiple frames is extracted and integrated through a certain fusion strategy, so that the information richness of the fused image is improved, the image quality enhancement processing is realized, and the image quality is improved.

In one embodiment, in step S303, performing image processing on each third candidate image to obtain the target image, including:

or,

In this embodiment, the target image may be obtained by performing the enhancement processing on each third candidate image and then performing the fusion processing, or the target image may be obtained by performing the enhancement processing on each third candidate image and then performing the fusion processing.

The enhancement processing may be implemented by at least one of the following image quality enhancement processing modes, such as demosaicing, denoising, deblurring, super resolution, dynamic range adjustment, and the like, so as to improve the image quality as much as possible.

The fusion processing can extract and integrate complementary information among the multi-frame images, so that the information richness of the fused images is improved.

Therefore, after enhancement processing and fusion processing are carried out on all the third candidate images, a frame of target image in the second data format is obtained, the target image integrates the original information in each first image, the complementation of the inter-frame information is realized, the image information is richer, and the image quality is higher.

The present invention also provides an image processing apparatus applied to an imaging device, referring to fig. 2, the image processing apparatus 100 may include:

a first processing module 101, configured to obtain, from a first video sequence in an acquired first data format, position information of a same target object in at least two video frames of the video sequence;

the second processing module 102 is configured to intercept areas corresponding to each position information from each video frame, so as to obtain at least two first images;

and the third processing module 103 is used for performing image processing on all the first images in the first data format to obtain target images in a second data format, wherein the second data format is suitable for displaying and/or transmitting the second images.

In one embodiment, the first processing module is specifically configured to, when acquiring, from the first video sequence in the acquired first data format, position information of the same target object in at least two video frames of the video sequence:

In one embodiment, when the first processing module converts each video frame in the first video sequence into the first candidate image, the first processing module is specifically configured to:

In one embodiment, when the first processing module performs the target detection processing on the first candidate map to obtain the position information of the target object in each video frame in the first video sequence, the first processing module is specifically configured to:

In one embodiment of the present invention, in one embodiment,

Or,

In one embodiment, when the third processing module performs image processing on all the first images in the first data format to obtain the target image in the second data format, the third processing module is specifically configured to:

converting each first image into a second candidate image;

In one embodiment, the third processing module converts each first image into a second candidate image, aligns a position of a target object in each second candidate image to obtain an aligned third candidate image, and when performing image processing on each third candidate image to obtain the target image, the third processing module is specifically configured to:

In one embodiment, the third processing module is specifically configured to, when converting each first image into the second candidate image:

In one embodiment, the third processing module aligns the position of the target object in each second candidate graph to obtain aligned third candidate graphs, and when performing image processing on each third candidate graph to obtain the target image, the third processing module is specifically configured to:

In one embodiment, when the third processing module performs image processing on the aligned image to obtain the target image, the third processing module is specifically configured to:

or,

In one embodiment, when the third processing module performs image processing on each third candidate image to obtain the target image, the third processing module is specifically configured to:

or,

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements.

The invention also provides an electronic device, which comprises a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the image processing method according to any one of the foregoing embodiments.

The embodiment of the image processing device can be applied to electronic equipment, and the electronic equipment can be a camera. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, as shown in fig. 17, fig. 17 is a hardware configuration diagram of an electronic device where the image processing apparatus 100 according to an exemplary embodiment of the present invention is located, and in addition to the processor 610, the memory 630, the interface 620, and the nonvolatile memory 640 shown in fig. 17, the electronic device where the apparatus 100 is located in the embodiment may further include other hardware according to an actual function of the electronic device, which will not be described herein.

The present invention also provides a machine-readable storage medium having stored thereon a program which, when executed by a processor, implements an image processing method as in any of the preceding embodiments.

The present invention may take the form of a computer program product embodied on one or more storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Machine-readable storage media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by the computing device.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. An image processing method, characterized by being applied to an imaging apparatus, comprising:

performing image processing on all the first images in the first data format to obtain target images in a second data format, wherein the second data format is suitable for displaying and/or transmitting the target images; the image processing of all the first images in the first data format to obtain target images in the second data format includes: converting each first image into a second candidate image; aligning the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs; performing image processing on each third candidate image to obtain the target image, wherein the image processing at least comprises fusion processing;

The step of converting each first image into a second candidate image, aligning the positions of the target objects in the second candidate images to obtain aligned third candidate images, and performing image processing on the third candidate images to obtain the target images includes: inputting each first image into a trained third neural network, carrying out color processing on each first image input by a color processing layer of the third neural network to obtain second candidate images, aligning the positions of a target object in each second candidate image by a sequence alignment sub-network of the third neural network to obtain aligned third candidate images, and carrying out image processing on each third candidate image by a sequence processing sub-network of the third neural network to obtain the target image;

2. The image processing method according to claim 1, wherein acquiring position information of the same target object in at least two video frames of the video sequence from the first video sequence in the acquired first data format, comprises:

3. The image processing method according to claim 2, wherein converting each video frame in the first video sequence into a first candidate image, and performing object detection processing on each first candidate image to obtain position information of an object in each video frame in the first video sequence, comprises:

4. The image processing method of claim 2, wherein converting each video frame in the first video sequence into a first candidate image comprises:

5. The image processing method according to claim 4, wherein performing object detection processing on the first candidate image to obtain position information of the object in each video frame in the first video sequence, comprises:

6. The image processing method according to claim 1, wherein,

Or,

7. The image processing method according to claim 1, wherein converting each first image into a second candidate image includes:

8. The image processing method according to claim 7, wherein aligning positions of the target object in each of the second candidate images to obtain aligned third candidate images, and performing image processing on each of the third candidate images to obtain the target image, comprises:

and inputting each second candidate graph into a trained fourth neural network, aligning the positions of the target objects in each second candidate graph by a sequence alignment sub-network of the fourth neural network to obtain aligned third candidate graphs, and carrying out image processing on each third candidate graph by a sequence processing sub-network of the fourth neural network to obtain the target image.

9. The image processing method according to claim 1 or 8, wherein the aligning the positions of the target objects in the second candidate images by the sequence alignment sub-network to obtain the aligned third candidate images includes:

10. The image processing method according to claim 1 or 8, wherein the sequence processing sub-network performs image processing on each third candidate image to obtain the target image, comprising:

11. The image processing method according to claim 1, wherein performing image processing on the aligned image to obtain the target image, comprises:

or,

12. The image processing method according to claim 1, wherein image processing each third candidate image to obtain the target image, comprises:

or,

13. An image processing apparatus, characterized by being applied to an imaging device, comprising:

the third processing module is used for carrying out image processing on all the first images in the first data format to obtain target images in a second data format, and the second data format is suitable for displaying and/or transmitting the target images; the image processing of all the first images in the first data format to obtain target images in the second data format includes: converting each first image into a second candidate image; aligning the positions of the target objects in the second candidate graphs to obtain aligned third candidate graphs; performing image processing on each third candidate image to obtain the target image, wherein the image processing at least comprises fusion processing;

14. An electronic device, comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the image processing method according to any one of claims 1 to 12.

15. A machine readable storage medium having stored thereon a program which, when executed by a processor, implements the image processing method according to any of claims 1-12.