CN108805874B

CN108805874B - Multispectral image semantic cutting method based on convolutional neural network

Info

Publication number: CN108805874B
Application number: CN201810595762.9A
Authority: CN
Inventors: 李含伦; 戴玉成; 张小博; 张晓灿; 唐文
Original assignee: Third Research Institute Of China Electronics Technology Group Corp
Current assignee: Third Research Institute Of China Electronics Technology Group Corp
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2022-04-22
Anticipated expiration: 2038-06-11
Also published as: CN108805874A

Abstract

The invention relates to a multispectral image semantic cutting method based on a convolutional neural network. The invention effectively solves the problem that the standard U-NET network can only accept one same-scale RGB \ Gray image through a network with various resolution ratio input and multichannel independent convolution, effectively improves the work efficiency of multispectral image semantic cutting and ensures the precision of image cutting.

Description

Multispectral image semantic cutting method based on convolutional neural network

Technical Field

The invention relates to a multispectral image semantic cutting method based on a convolutional neural network.

Background

Currently, the advanced semantic cut framework for RGB images commonly employs end-to-end Deep Convolutional Neural Networks (DCNN). The current usage form of the convolutional neural network is to classify objects by using a plurality of pre-trained models, and the common models mainly include VGG, ResNet and the like. The DCNN for semantic segmentation usually includes two parts, namely a front part and a rear part, wherein the front part is a commonly used DCNN network with better quality, and the rear part is a network for mapping a feature map into pixel labels. In order to save training samples, the model parameters which are pre-trained are directly adopted in the first half part, and only the model parameters in the second half part are finely adjusted.

Currently, a relatively representative image semantic segmentation network is a full-volume network (FCN), and an initial version of the network is based on VGG-16. Due to the second half of the fully-connected network of the VGG-16 designed for classification, the fully-connected operation completely loses the spatial information of the feature map so that the spatial information cannot be used for semantic segmentation of the image; the FCN replaces the non-convolutional network portion (fully-connected portion) of VGG-16 with a convolution operation and recovers a characteristic representation of each pixel using upsampling and deconvolution and further computes a class label for each pixel. The main disadvantage of this network is that 5 times of 2 times of upsampling is used to recover the size of the feature map, and a large amount of spatial information is lost in the downsampling process, and the information cannot be completely recovered in the upsampling process, so that the result of image segmentation is very rough. A common modification to FCN is post-processing the processed results using Conditional Random Fields (CRF). This approach improves the FCN result coarsening to some extent, but results in further increases in FCN memory and computation time.

Deep lab is another type of image semantic segmentation network with a large influence, which is based on a network on a deep residual error network (ResNet DCNN). DeepLab replaces the conventional convolution kernel reduction image down-sampling problem with a perforated convolution kernel (aperture convolution filter). The convolution kernel with the holes inserts points with the weight value of 0 into the traditional convolution kernel at certain intervals, so that the receptive field of the convolution kernel can be increased on the premise that the training parameters are not increased. Therefore, when the image is subjected to convolution processing by the multi-layer convolution kernel, the image can still maintain the original size. However, some scholars have found that when the entire network employs the punctured convolution kernel, the processing efficiency is very low, and thus the conventional convolution kernel and the punctured convolution kernel need to be employed simultaneously.

The U-NET is a neural network originally designed for biomedical image segmentation, and since the lower pooling part and the upper pooling part of the network are basically symmetrical, the author draws the architecture of the network into a figure, and the shape of the figure is very similar to the letter U, so that the network is named as U-NET. U-NET is broadly one type of FCN network. Since the network wins 2015-degree ISBI cell tracking challenge and is widely reported, the network has very wide influence on the image semantic segmentation research direction, especially the biomedical image segmentation direction. The U-NET network is ingeniously designed, the left half part of the U-NET network is a resolution contraction part, the right half part of the U-NET network is a resolution expansion part, and the resolution contraction part and the resolution expansion part are bilaterally symmetrical. In the resolution expansion part, fusion association (association) operation is carried out on the feature map which is pooled at each time and the feature map corresponding to the left half part, so that the convolution kernel of each resolution of the resolution expansion part is from the up-sampling part of the lower-layer feature map and the part corresponding to the left-side resolution, and the loss of spatial information caused by the scale scaling of the feature map is reduced to the greatest extent.

The research result of the semantic segmentation of the image is designed for common digital images or medical scanning images, and the basic assumption is that the image specifications for training and segmentation are unified. For example, it assumes that all training samples have the same number of channels (multichannel RGB and single channel Gray images). This makes the use of these research results in multispectral images difficult. Firstly, multispectral images often have the characteristics of multiple channels, and different channels contain different information amounts, each convolution kernel of the traditional convolution method convolutes all feature maps and accumulates the convolution results, and the feature map of each channel is equal to the classified result by default. Secondly, the same satellite is often equipped with various types of multispectral sensors, for example, a WorldView 3 commercial remote sensing satellite can simultaneously obtain a full-band image (panchrometic), multispectral data (red, total, blue, green, yellow, near-IR1 and near-IR2) in a visible light near-infrared range (400-; the wave band numbers of the three are respectively 1 wave band, 8 wave band and 8 wave band; the resolution of the three points under the star (Nadir) is 0.31m, 1.24m and 7.5 m. The data collected by different types of sensors not only have different wave band numbers and wave band types, but also have different resolutions. If all the low-resolution images are interpolated and enlarged forcibly to be unified with the high-resolution images, some convolution operations of the low-resolution image portions are ineffective, which not only loses a large amount of calculation time, but also may interfere with the image cutting result. If the down-sampling is performed on the high-resolution image to make the image scale uniform with the low-resolution image, a large amount of spatial information of the high-resolution image is lost.

Disclosure of Invention

The invention aims to provide a multispectral image semantic cutting method based on a convolutional neural network, which can improve the work efficiency of multispectral image semantic cutting and ensure the precision of image cutting.

The technical scheme for realizing the purpose of the invention is as follows:

a multispectral image semantic cutting method based on a convolutional neural network is characterized by comprising the following steps: and independently convolving each data channel of the multispectral image by using a convolutional neural network, and then fusing the feature maps after independent convolution of each data channel.

Further, when independently convolving each data channel of the multispectral image, different sizes and different numbers of convolution kernels are selected according to different wave bands.

Further, when independently convolving each data channel of the multispectral image, different convolution layer numbers are selected according to different wave bands.

Further, data of different resolutions are input to convolution layers of respective different scale levels.

Further, when data with different resolutions are input to the convolutional layers with corresponding different scale levels, the input data are fused with the feature map of the convolutional layer after pooling.

Further, normalizing the data with different resolutions to the highest resolution of the data, and inputting the data to the network once after the data is serially connected and fused; then separating the data of different categories in the network, and respectively processing the data to the required size; and inputting the data with different resolutions into the convolution layers with corresponding different scale levels.

The invention has the following beneficial effects:

aiming at the condition that the data difference between different channels of multispectral data is large, the method uses a convolution neural network to independently convolve each data channel of the multispectral image, and then fuses the feature maps after independent convolution of each data channel. When independently convolving each data channel of the multispectral image, different sizes and different numbers of convolution kernels can be selected according to different wave bands, and different numbers of convolution layers can be selected according to different wave bands.

Aiming at the condition that the resolution ratio of multispectral data is large in difference, the U-NET network is transformed into a convolutional neural network supporting input of multiple resolution ratios, data with different resolution ratios are input into convolutional layers with corresponding different scale levels, the input data are fused with a characteristic graph after the convolutional layers are pooled, and the last layer is the layer above the convolutional layers corresponding to the input data. The invention effectively solves the problem that the standard U-NET network can only accept one same-scale RGB \ Gray image through a network with various resolution ratio input and multichannel independent convolution, effectively improves the work efficiency of multispectral image semantic cutting and ensures the precision of image cutting.

Aiming at the condition that one network model of a deep learning development platform mostly only receives one-time input, the invention processes data with different resolutions to the same resolution by taking the highest resolution of the data as the standard, then performs serial fusion, inputs the data into a network, separates data with different categories in the network, and processes the separated data to a proper size, thereby effectively ensuring the working reliability of the multispectral image semantic cutting.

Drawings

FIG. 1 is a schematic diagram of the independent convolution of multiple image channels of the present invention;

fig. 2 is a schematic diagram of the channel independent convolution and multi-scale input U-NET deep neural network of the present invention.

Detailed Description

The first embodiment is as follows:

for the multispectral image, the wave bands are separated according to the wavelength, then independent convolution operation is performed on different wave bands, namely, a convolution neural network is used for independently convolving each data channel of the multispectral image, and then feature graphs after independent convolution of each data channel are fused (convolution). When independently convolving each data channel of the multispectral image, different sizes and different numbers of convolution kernels are selected according to different wave bands. And when independently convolving each data channel of the multispectral image, selecting different convolution layer numbers according to different wave bands. When the method is implemented, the convolution neural network adopts a U-NET neural network.

Example two:

when the multispectral image has multiple resolutions, on the basis of adopting the multi-channel independent convolution in the first embodiment, the multi-channel independent convolution and multi-resolution input network is adopted in the second embodiment. As shown in fig. 2, the U-NET network is modified to support a convolutional neural network with multiple resolution inputs. Similar to the traditional U-NET network, the network of the invention is composed of a scale contraction part and a scale expansion part, wherein the scale contraction part is composed of a classical convolution network, the image size is reduced along with the increase of the convolution pooling times along with the increase of the convolution hierarchy, and the number of convolution kernels is increased along with the increase of the pooling times. The scale expansion part is the same as that of the U-NET network, the scale is increased by two times in each up-sampling step of the scale expansion part, and the number of convolution kernels is reduced by half. After each up-sampling, the sampled feature map and the feature map with the same scale as the symmetric part (the contracted part) need to be subjected to merging operation (summation). Different from the traditional U-NET network, the data with different resolutions are input into the convolutional layers with corresponding different scale levels, the input data is fused with the characteristic diagram of the convolutional layer which is formed by pooling the previous layer, and the previous layer is the previous layer of the convolutional layer corresponding to the input data. In fig. 2, the open arrows indicate channel independent convolution, the right thin arrows indicate channel duplication, the juxtaposition of solid and open rectangles indicates a fusion operation, the downward wide arrows indicate a lower pooling operation, the upward wide arrows indicate an upper pooling operation, and the rightward wide arrows indicate a classical convolution operation.

Claims

1. A multispectral image semantic cutting method based on a convolutional neural network is characterized by comprising the following steps: independently convolving each data channel of the multispectral image by using a convolutional neural network, and then fusing the feature maps after independent convolution of each data channel;

independently convolving each data channel of the multispectral image, selecting different sizes and different numbers of convolution kernels according to different wave bands;

when independently convolving each data channel of the multispectral image, selecting different convolution layers according to different wave bands;

the convolution neural network adopts a U-NET neural network;

the U-NET neural network supports the input of data with various resolutions, and the data with different resolutions are input to convolution layers with corresponding different scale levels;

when data with different resolutions are input into the convolutional layers with corresponding different scale levels, the input data are fused with the characteristic diagram of the convolutional layer which is formed by pooling the data in the previous layer, wherein the previous layer is the layer which corresponds to the convolutional layer and is input with the input data.

2. The convolutional neural network-based multispectral image semantic segmentation method of claim 1, wherein: normalizing data with different resolutions to the highest resolution of the data, and inputting the data to a network once after series connection and fusion; then separating the data of different categories in the network, and respectively processing the data to the required size; and inputting the data with different resolutions into the convolution layers with corresponding different scale levels.