CN116630388A

CN116630388A - Thermal imaging image binocular parallax estimation method and system based on deep learning

Info

Publication number: CN116630388A
Application number: CN202310913387.9A
Authority: CN
Inventors: 李骏; 瞿嘉明; 杨苏
Original assignee: Suzhou Lichuang Zhiheng Electronic Technology Co ltd
Current assignee: Suzhou Lichuang Zhiheng Electronic Technology Co ltd
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-08-22

Abstract

The application provides a binocular parallax estimation method and a binocular parallax estimation system for a thermal imaging image based on deep learning. And reconstructing a cross-correlation matching cost volume and a cascade matching cost volume of the binocular feature map, combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume, encoding and decoding the total matching cost volume to obtain an aggregate feature map with various network depths, and generating a disparity map according to the aggregate feature map with various network depths. The method can effectively extract the characteristic information from the thermal imaging images from different visual angles to generate the parallax image, and improves the accuracy and stability of acquiring depth information based on the thermal imaging images.

Description

Thermal imaging image binocular parallax estimation method and system based on deep learning

Technical Field

The application relates to the field of deep learning and computer vision, in particular to a thermal imaging image binocular parallax estimation method and system based on the deep learning.

Background

Thermal imaging is a non-contact measurement technique that acquires the temperature distribution of the surface of an object by detecting infrared radiation emitted by the object. Are used in a wide variety of fields, such as industry, medicine, military, security, etc. Binocular disparity estimation is a computer vision perception technique that calculates depth information of objects in a field by analyzing images acquired from different angles. Binocular disparity estimation methods are typically based on computer vision perception algorithms, such as region matching, stereo matching, optical flow estimation, and the like, and use visible light images or thermographic images for binocular disparity estimation.

However, there are some differences between the thermographic image and the visible image. The special points of the thermal imaging image comprise a convex gray scale range, a low image classification rate, unclear texture information and the like, so that the binocular parallax estimation method cannot fully utilize the characteristics of the thermal imaging image, and further the accuracy and the fixity are limited when the thermal imaging image is processed. Meanwhile, in the aspect of binocular parallax estimation of thermal imaging, the method mainly focuses on the prediction processing and feature extraction of thermal imaging images by using a computer vision algorithm, and reduces the accuracy and stability in the aspect of depth information acquisition.

Disclosure of Invention

The application provides a thermal imaging image binocular parallax estimation method and a thermal imaging image binocular parallax estimation system based on deep learning, which are used for solving the problem of low accuracy and stability of obtaining depth information based on a thermal imaging image.

In a first aspect, the present application provides a thermal imaging image binocular parallax estimation method based on deep learning, which is characterized by comprising:

acquiring a binocular thermal imaging image, wherein the binocular thermal imaging image comprises a first image and a second image;

extracting an initial feature map of the binocular thermal imaging image;

obtaining a scale feature map of the initial feature map according to a preset scale, and combining the scale feature map and the initial feature map to obtain a binocular feature map, wherein the binocular feature map comprises a first feature map and a second feature map;

constructing a cross-correlation matching cost volume and a cascade matching cost volume of the binocular feature map, and combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume, wherein the cross-correlation matching cost volume is obtained based on feature cross-correlation operation, and the cascade matching cost volume is obtained based on feature cascade;

coding and decoding the total matching cost volume to obtain an aggregation feature map of various network depths;

and generating a disparity map according to the aggregate feature maps of the plurality of network depths.

In a second aspect, the application provides a thermal imaging image binocular parallax estimation system based on deep learning, which comprises an image acquisition module, a feature extraction module, a multi-level average pooling module, a matching cost roll construction module, a three-dimensional convolution aggregation module and a parallax map generation module, wherein:

the image acquisition module is used for acquiring a binocular thermal imaging image, wherein the binocular thermal imaging image comprises a first image and a second image;

the feature extraction module is used for extracting an initial feature map of the binocular thermal imaging image;

the multi-level average pooling module is used for acquiring a scale feature map of the initial feature map according to a preset scale, and combining the scale feature map with the initial feature map to obtain a binocular feature map, wherein the binocular feature map comprises a first feature map and a second feature map;

the matching cost volume construction module is used for constructing a cross-correlation matching cost volume and a cascade matching cost volume of the binocular feature map, combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume, wherein the cross-correlation matching cost volume is obtained based on feature cross-correlation operation, and the cascade matching cost volume is obtained based on feature cascade;

the three-dimensional convolution aggregation module is used for encoding and decoding the total matching cost volume so as to obtain an aggregation feature map with various network depths;

and the disparity map generating module is used for generating a disparity map according to the aggregation feature maps of the plurality of network depths.

According to the technical scheme, the binocular parallax estimation method and the binocular parallax estimation system for the thermal imaging image based on the deep learning can be used for extracting an initial feature map of the binocular thermal imaging image after the binocular thermal imaging image is acquired, acquiring a scale feature map of the initial feature map according to a preset scale, and combining the scale feature map and the initial feature map to obtain the binocular feature map. And reconstructing a cross-correlation matching cost volume and a cascade matching cost volume of the binocular feature map, combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume, encoding and decoding the total matching cost volume to obtain an aggregate feature map with various network depths, and generating a disparity map according to the aggregate feature map with various network depths. The method can effectively extract the characteristic information from the thermal imaging images from different visual angles to generate the parallax image, and improves the accuracy and stability of acquiring depth information based on the thermal imaging images.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a binocular disparity estimation method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of obtaining a binocular feature map according to an embodiment of the present application;

fig. 3 is a schematic flow chart of acquiring a total matching cost volume according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a three-dimensional convolution aggregation module according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the examples below do not represent all embodiments consistent with the application. Merely exemplary of systems and methods consistent with aspects of the application as set forth in the claims.

Binocular disparity estimation is a computer vision perception technique that calculates depth information of objects in a field by analyzing images acquired from different angles. Binocular disparity estimation methods are typically based on computer vision perception algorithms, such as region matching, stereo matching, optical flow estimation, and the like, and use visible light images or thermographic images for binocular disparity estimation.

However, there are some differences between the thermographic image and the visible image. The special points of the thermal imaging image comprise a convex gray scale range, a low image classification rate, unclear texture information and the like, so that the binocular parallax estimation method cannot fully utilize the characteristics of the thermal imaging image, and further the accuracy and the fixity are limited when the thermal imaging image is processed. Meanwhile, in the aspect of binocular parallax estimation of thermal imaging, the method mainly focuses on the prediction processing and feature extraction of thermal imaging images by using a computer vision algorithm, so that the accuracy and stability in the aspect of depth information acquisition are lower.

In order to solve the above-mentioned problems, some embodiments of the present application provide a binocular disparity estimation method for a thermal imaging image based on deep learning, which is applied to a binocular disparity estimation system, wherein the binocular disparity estimation system includes: the device comprises an image acquisition module, a characteristic extraction module, a multi-level average pooling module, a matching cost volume construction module, a three-dimensional convolution aggregation module and a parallax image generation module. Fig. 1 is a flow chart of a binocular disparity estimation method according to an embodiment of the present application, as shown in fig. 1, the binocular disparity estimation method provided by the present application includes the following steps:

s100: a binocular thermographic image is acquired.

The binocular thermal imaging image comprises a first image and a second image, wherein the first image and the second image are images acquired from two visual angles of the same scene respectively. For example, the first image and the second image are images acquired from two angles of view about the same scene and each having a size of 384×1248 (h×w), that is, images having a height of 384 and a width of 1248. It should be noted that, the batch-size (the number of samples selected by one training) of the binocular disparity estimation model provided by the present application is 1, that is, the sample number dimension can be ignored, and only the four dimensions of the feature, the disparity, the height and the width are emphasized.

In some embodiments, after the first image and the second image are acquired, epipolar registration may be performed on the first image and the second image, so that homonymous points on the first image and the second image are located on the same horizontal line, that is, only a column coordinate difference exists between the first image and the second image. And taking the binocular thermal imaging image after polar line registration as the input of the binocular parallax estimation model.

S200: an initial feature map of the binocular thermal imaging image is extracted.

After the binocular thermal imaging image is acquired, feature information of the binocular thermal imaging image can be extracted through a feature extraction module. The feature extraction module comprises a plurality of layers of convolution kernels with preset sizes and a plurality of groups of residual blocks, a binocular thermal imaging image is input into the plurality of layers of convolution kernels with preset sizes, and then a feature map output by the convolution kernels with the preset sizes is sequentially input into the plurality of groups of residual blocks to obtain an initial feature map, wherein the plurality of groups of residual blocks comprise residual blocks for deepening the depth of a feature extraction network and residual blocks for increasing the receptive field by using expansion convolution.

For example, the first image and the second image, both having dimensions 384×1248 (h×w), are polar registered, and the image input dimensions 384×1248×3 (3 is the RGB feature dimension). And extracting features of the first image and the second image after polar line registration to generate an initial feature map. First, the first image and the second image are input into a three-layer convolution kernel of a size of 3×3 to expand the feature dimension, outputting a feature map of a size of 192×624×32.

It can be understood that the feature extraction module provided by the application discards a large convolution kernel (for example, the size is 7×7), and adopts a depth network consisting of three layers of small convolution kernels with the size of 3×3, wherein the step size of the first layer of convolution kernels can be 2, so that the height and width can be halved, meanwhile, the feature extraction module can acquire a receptive field with the same size with the convolution kernels with the size of 7×7 in a single layer, effectively reduces the number of parameters on the basis of increasing the network capacity, improves the efficiency, and increases the feature dimension output channel.

Then, the 192×624×32 feature map is subjected to four sets of residual blocks conv1_3, conv2_16, conv3_3, conv4_3, which extract the meta feature information.

Wherein the network of the first group of residual blocks conv1_3 isStep size is 1, and output dimension is 192×624×32;

the network of the second group of residual blocks conv2_16 isThe step length is 2, the feature dimension is doubled and the height and width are halved, the depth of the feature extraction network is deepened, and the output dimension is 96 multiplied by 312 multiplied by 64;

the third group of residual blocks conv3_3 and the fourth group of residual blocks conv4_3 each apply an expansion convolution, the networks of which are allThe receptive field can be further increased.

The expansion coefficients (conditions) are set to 2 and 4, respectively, and the output dimension is 96×312×128, that is, an initial feature map with feature map dimensions of 96×312×128 is obtained by the feature extraction module.

It can be appreciated that the dilation convolution allows the convolution output to obtain a wider range of neighborhood pixel information with the same amount of computation without changing the feature map size. In addition, the network parameters of the feature extraction module that process the first image and the second image input are shared.

S300: and obtaining a scale feature map of the initial feature map according to a preset scale, and combining the scale feature map and the initial feature map to obtain a binocular feature map.

Compared with the image shot by a common camera, the thermal imaging image may have the characteristics of blurred edges, unclear layers, gradient colors, sparse textures and the like. Therefore, after the feature extraction, the application adds a multi-level average pooling layer module for better aggregating the global and local context feature information.

The multi-level average pooling module comprises a plurality of adaptive average pooling layers with preset dimensions, and each average pooling layer is connected with a convolution layer with preset dimensions and used for adjusting the number of characteristic channels. Firstly, an initial feature map can be input into an average pooling layer with a preset scale, then the feature map output by the average pooling layer is input into a convolution layer with a preset size, and finally, the size of the feature map output by the convolution layer is restored to the feature map size of the initial feature map, so that a scale feature map is obtained. And merging the scale feature map and the initial feature map to obtain a binocular feature map.

Illustratively, as shown in fig. 2, the multi-level averaging and pooling module mainly consists of four adaptive averaging and pooling layers with four scales of 64×64, 32×32, 16×16 and 8×8, and each averaging and pooling layer is followed by a convolution layer with a size of 3×3 to adjust the number of characteristic channels to 32. The initial feature diagram input to the feature extraction module after passing through the feature extraction module is 96×312×128 in size, and the initial feature diagram is input into four-scale average pooling layers and is of the sizeAfter the convolutional layer of (a), the number average of characteristic channels becomes 32.

Since the output feature graphs of different average pooling layers are required to be combined later, the feature graphs output by the convolution layer are up-sampled and restored to the feature graph size input to the module, and four scale feature graphs with the size of 96×312×32 are obtained, wherein the four scale feature graphs comprise four scale features and serve as global context information.

And finally, combining the four scale feature images with the initial feature image. When merging, an output image of a residual block for deepening the depth of the feature extraction network can be obtained, and an output image of a residual block for increasing the receptive field by applying dilation convolution can be obtained, and the output image and the scale feature map are merged to obtain a binocular feature map. That is, as shown in fig. 2, the four scale feature maps may be combined with the output of the second set of residual blocks conv2_16 and the output of the fourth set of residual blocks conv4_3 in the feature extraction module, that is, the final binocular feature map is obtained by combining the global and local context information, where the feature map sizes are: (96×312×32) ×4+96×312×64+96×312×128=96×312×320, thereby making the information in the matching cost volume of the subsequent construction more perfect.

The first feature map and the second feature map are feature maps corresponding to a first image and a second image obtained by binocular head shooting respectively. The first image and the second image may be respectively referred to as a left image and a right image, an upper image and a lower image, and two images of other positional relationships according to the setting mode of the binocular heads, and the corresponding feature map may be a left feature map and a right feature map, an upper feature map and a lower feature map, etc., which are not particularly limited in the present application.

It can be appreciated that the multi-level average pooling module provided by the application can be used for effectively solving the problem of inaccurate extraction of the characteristics of the thermal imaging image aiming at the pathological area with blurred edges and sparse textures of the thermal imaging image. In addition, the present application is not particularly limited to the specific embodiment of up-sampling, and for example, up-sampling may be performed by a bilinear interpolation method.

S400: and calculating a cross-correlation matching cost volume and a cascade matching cost volume of the binocular feature map, and combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume.

The cross-correlation matching cost volume is obtained based on feature cross-correlation operation, and the cascade matching cost volume is obtained based on feature cascade.

It can be understood that the matching cost volume is essentially a 4-dimensional matrix, compared with the output binocular feature map (the first feature map and the second feature map), the parallax dimension is increased, and because the matching degree of the binocular feature map under different parallax grades is finally required to be studied by the network, the left parallax d and the right parallax d of each pair of homonymous pixel points under each feature can be learned, and then an accurate parallax map can be generated. Assume that the maximum parallax of the homonymous point pair of the binocular feature map isThen it is equivalent to enumerating all possible disparity value intervals as +.>Let->As the size of the final parallax dimension, the feature dimension size of the matching cost volume in the module is D/4 because the feature image height and width after the multi-level average pooling module is 1/4 of the original input image size and is finally restored to the original input image size through the up-sampling operation.

It can be understood that constructing the matching cost volume mainly includes two ways, the first way is to measure the similarity of the features in a full-correlation way, that is, under each parallax value, the first feature map and the second feature map do full-feature correlation operation, but because it only generates one single-channel correlation map for each parallax level, there is information loss; another way is to directly concatenate the feature channels of the first and second feature images at each disparity level, i.e. without feature similarity information, however, this way requires the use of more parameters in the subsequent aggregation network, learning the similarity measure from scratch.

Therefore, the matching cost volume construction module provided by the application consists of two parts, wherein one part is a cascade matching cost volume formed by cascading the left second characteristic images and reducing the number of characteristic channels, the other part is a cross-correlation matching cost volume obtained by carrying out cross-correlation operation on the left second characteristic image groups and averaging, and the two parts are combined to be used as a complete total matching cost volume in the application.

In some embodiments, when constructing the cross-correlation matching cost volume, the feature channels of the first feature map and the second feature map are respectively divided into a plurality of feature groups, the inter-group inner products of the feature channels in each feature group of the first feature map and the second feature map at each parallax level are calculated, and the inter-group inner products in each feature group are averaged to obtain the cross-correlation matching cost volume.

Illustratively, as shown in fig. 3, the first feature map and the second feature map obtained after the multi-level average pooling are 96×312×320 in size. The 320 feature channels are divided into 40 groups, 8 feature channels in each group correspond to 96×312×8 in size, the first feature map and the second feature map perform inter-group inner product operation of each group of features at each parallax level, average is performed on the 8 feature channels, feature dimensions are compressed, and a matrix with the size of D/4×96×312×24 is obtained as a cross-correlation matching cost volume.

In some embodiments, when constructing a cascade matching cost volume, a first feature map is input into a convolution layer, and a second feature map is input into the convolution layer, wherein the convolution layer is used for compressing a feature channel, and the first feature map and the second feature map after compressing the feature channel are cascade connected in a feature dimension to obtain the cascade matching cost volume.

It will be appreciated that cascading the left second feature map at each disparity level in the feature dimension, but only so, would result in excessive parameters required in the subsequent aggregation network, thus laminating the feature channels by two convolutions prior to cascading.

Exemplary, as shown in FIG. 3, the first and second feature maps first pass through oneIs output with 128 channels and output size of 96×312×128, and is subjected to a +.>Is 12, and outputs a matrix with a size of 96×312×12 as a first feature map and a second feature map after compressing the feature channels. Finally, the two are cascaded from the characteristic dimension and added with the parallax dimension, and then a matrix with the size of D/4 multiplied by 96 multiplied by 312 multiplied by 24 is obtained and is used as a cascade matching cost volume.

After the obtained cross-correlation matching cost volume and cascade matching cost volume are combined, a matrix with the size of D/4 multiplied by 96 multiplied by 312 multiplied by 64 is obtained and is used as the total matching cost volume finally constructed by the application.

In addition, the above-described "at each parallax level" in the construction process is embodied as: when the parallax is d, for the first feature map and the second feature map, whether the inner product operation of the group feature cross-correlation or the direct cascade concatenation operation, only the left H× (W-d) part of the left map and the right H× (W-d) part of the right map are taken, and the rest is filled with 0.

It will be appreciated that in the present application the cross-correlation matching cost volume is essentially a grouping of the univariate features into groups and then calculating their dot products, i.e. costs, group by group. Parameters are reduced through the concept of the group, and meanwhile, information cannot be lost because the group is provided with a plurality of channels; for cascade matching cost volumes, the characteristic channels are compressed, so that the situation of excessive parameters in the subsequent aggregation network is not caused. The cross-correlation matching cost volume provides similarity of feature vector matching, while the cascade matching cost volume provides semantic information as a complement, which are complementary.

S500: and encoding and decoding the total matching cost volume to obtain an aggregate feature map of various network depths.

The three-dimensional convolution aggregation module provided by the application aggregates the characteristic information along the parallax dimension and the space dimension through three-dimensional convolution. The three-dimensional convolution aggregation module comprises a convolution layer with a residual structure and three stacked coding and decoding networks with the same structure, wherein the convolution layer with the residual structure comprises two layers of three-dimensional convolution layers, namely a first layer of three-dimensional convolution layer and a second layer of three-dimensional convolution layer, the coding and decoding network comprises two coding layers and two decoding layers, namely a first coding layer, a second coding layer, a first decoding layer and a second decoding layer, the first coding layer and the second coding layer are used for doubling the number of characteristic channels and halving the height and width of a characteristic graph. The first decoding layer and the second decoding layer are transposed convolutions, and the first decoding layer and the second decoding layer are used for halving the number of characteristic channels and doubling the aspect width of the characteristic graph, namely, the coding and decoding structure is a process from fine to coarse to fine with intermediate layer supervision.

After the total matching cost roll is obtained, sequentially inputting the total matching cost roll into two layers of three-dimensional convolution layers, and cascading the output of the second layer of three-dimensional convolution layer with the output of the first layer of three-dimensional convolution layer to obtain a first aggregation feature map. And sequentially inputting the first aggregation feature map into three layers of coding and decoding networks, cascading the output of the first decoding layer with the output of the first coding layer and cascading the output of the second decoding layer with the output of the second layer of three-dimensional convolution layer for each layer of coding and decoding network, so as to obtain a second aggregation feature map, a third aggregation feature map and a fourth aggregation feature map which are respectively output by the three layers of coding and decoding networks. Thus, four aggregation feature maps with different network depths are obtained.

Illustratively, as shown in fig. 4, the total matching cost volume with the size of D/4×96×312×64 is constructed by the matching cost volume construction module, and is input to the three-dimensional convolution aggregation module for parameter learning. First, by one convolution layer of the residual structure, i.e., the first layer three-dimensional convolution layer 3Dconv0a and the second layer three-dimensional convolution layer 3Dconv0b, both pass through two convolution kernels of size 3 x 3, and the first layer three-dimensional convolution layer 3Dconv0a reduces the feature dimension to 32, the output of the second three-dimensional convolution layer 3Dconv0b is cascaded with the output of the first three-dimensional convolution layer 3Dconv0a to form a residual structure, and then a first aggregation characteristic diagram with the size of D/4 multiplied by 96 multiplied by 312 multiplied by 32 is obtained.

Then through three layers of coding and decoding networks with the same structure, each coding and decoding network comprises two coding layers and two decoding layers, wherein the first coding layer comprises a third three-dimensional convolution layer 3Dconv1a, a fourth three-dimensional convolution layer 3Dconv1b and two convolution kernels with the size of 3 multiplied by 3, the third three-dimensional convolution layer 3Dconv1a doubles the number of characteristic channels, the height and width of a characteristic diagram are halved, and the characteristic diagram with the size of D/8 multiplied by 48 multiplied by 156 multiplied by 64 is output; the second coding layer comprises a fifth three-dimensional convolution layer 3Dconv2a, a sixth three-dimensional convolution layer 3Dconv2b and two convolution kernels of 3 multiplied by 3, the fifth three-dimensional convolution layer 3Dconv2a continuously doubles the number of characteristic channels, the height and width of the characteristic graph are halved, and the characteristic graph with the size of D/16 multiplied by 24 multiplied by 78 multiplied by 128 is output; the first decoding layer 3Ddeconv1 is a 3 multiplied by 3 transposed convolution, up-sampling is carried out, the number of characteristic channels is reduced by half the width of the peak to be doubled, and meanwhile, the fourth three-dimensional convolution layer 3Dconv1b in the first encoding layer is cascaded to be output, so that a characteristic diagram with the size of D/8 multiplied by 48 multiplied by 156 multiplied by 64 is obtained; the second decoding layer 3Ddeconv2 is also a 3 x 3 transposed convolution, and upsampling is performed, the number of characteristic channels is doubled by half the width of the high, and simultaneously, the second three-dimensional convolution layer 3Dconv0b in the dimension reduction part is cascaded to obtain a characteristic diagram with the dimension of D/4 multiplied by 96 multiplied by 312 multiplied by 32. Thus, three aggregated feature maps from different depth layer networks can be obtained through three codec networks, all with dimensions D/4×96×312×32.

In this embodiment, the three-dimensional convolution aggregation module uses the connection of the residual structure for multiple times, so that on one hand, the function of avoiding gradient disappearance of the residual network is exerted, on the other hand, because the module performs up-down sampling for multiple times, in order to avoid losing the detailed information of the feature map, the deep network output in the coding and decoding network can be cascaded with the shallow layer output, and further the feature information of the deep and shallow layers is combined. That is, the module iteratively processes matching cost volumes, utilizing as much semantic information as possible in a global multi-scale.

S600: and generating a disparity map according to the aggregate feature maps of the plurality of network depths.

After the aggregation feature images are obtained, feature channels of the aggregation feature images are combined, the aggregation feature images after the feature channels are combined are up-sampled, so that the parallax dimension and the feature image dimension of the aggregation feature images are restored to the image dimension of the binocular thermal imaging image, and the aggregation feature images are regressed based on a preset activation function to obtain the parallax images.

For example, in the above three-dimensional convolution aggregation module, the dimension reduction process and the three codec networks each output an aggregate feature map with a dimension D/4×96×312×32, which is used as an input of the disparity map generating module, and represents different network depths to incorporate calculation of the disparity map and the loss function. First, the aggregate feature map will go through two three-dimensional convolution layers of size 3 x 3, and combining the feature channel numbers to be 1, and further obtaining an aggregate feature map with the output sizes of D/4 multiplied by 96 multiplied by 312 multiplied by 1. And then up-sampling the aggregate feature map, and restoring the parallax dimension and the feature map size to the original map size of the binocular thermal imaging image to obtain an aggregate feature map with the size of D multiplied by 384 multiplied by 1248 multiplied by 1. And converting the parallax dimension into a probability matrix by using a softmax function, and calculating the parallax estimated value of each pixel by using a softargmin function to obtain four parallax images with the size of 384 multiplied by 1248 multiplied by 1.

In addition, the loss function of the whole network can be obtained by performing smooth L1 loss on the four disparity maps and the real disparity map and performing weighted sum. The module utilizes the parallax images output by four different depths of the network to calculate the loss function in a weighting way, and can fully integrate the context information of a multi-level network. Meanwhile, the smooth L1 loss is compared with the L1 loss, the gradient explosion problem near 0 is processed, and the L2 loss is not excessively sensitive to abnormal points.

In the embodiment, after the binocular thermal imaging image is obtained, feature information of the binocular thermal imaging image is extracted through a feature extraction module, and feature dimensions are expanded; then, through a multi-level average pooling module, global context information of the feature map is obtained in a multi-scale mode; regenerating a matching cost roll for component feature cross-correlation, calculating an inner product layer by layer through a feature map layer and grouping and averaging in a feature dimension to obtain additional inter-group feature cross-correlation information, and cascading the additional inter-group feature cross-correlation information with the matching cost roll directly spliced by the feature map in the feature dimension to generate a total matching cost roll; encoding and decoding the matching cost volume through a three-dimensional convolution aggregation module, cascading the feature graphs with different network depths, and avoiding the loss of detail feature information; and finally merging the feature dimensions, up-sampling the feature dimensions back to the size of the input picture, performing function regression on the parallax dimensions, obtaining a parallax image, weighting and calculating a loss function, and returning to the training model.

Based on the binocular disparity estimation method described above. The application also provides a thermal imaging image binocular parallax estimation system based on deep learning, which comprises an image acquisition module, a feature extraction module, a multi-level average pooling module, a matching cost volume construction module, a three-dimensional convolution aggregation module and a parallax image generation module.

The image acquisition module is used for acquiring a binocular thermal imaging image, and the binocular thermal imaging image comprises a first image and a second image.

And the feature extraction module is used for extracting an initial feature map of the binocular thermal imaging image.

And the multi-level average pooling module is used for acquiring a scale feature map of the initial feature map according to a preset scale, and combining the scale feature map and the initial feature map to obtain a binocular feature map, wherein the binocular feature map comprises a first feature map and a second feature map.

And the matching cost volume construction module is used for constructing the cross-correlation matching cost volume and the cascade matching cost volume of the binocular feature map, and combining the cross-correlation matching cost volume and the cascade matching cost volume to obtain a total matching cost volume. The cross-correlation matching cost volume is obtained based on feature cross-correlation operation, and the cascade matching cost volume is obtained based on feature cascade.

And the three-dimensional convolution aggregation module is used for encoding and decoding the total matching cost volume so as to obtain an aggregation feature map with various network depths.

And the disparity map generation module is used for generating a disparity map according to the aggregate feature maps of various network depths.

The above-provided detailed description is merely a few examples under the general inventive concept and does not limit the scope of the present application. Any other embodiments which are extended according to the solution of the application without inventive effort fall within the scope of protection of the application for a person skilled in the art.

Claims

1. The thermal imaging image binocular parallax estimation method based on the deep learning is characterized by comprising the following steps of:

extracting an initial feature map of the binocular thermal imaging image;

2. The deep learning based thermal imaging image binocular disparity estimation method according to claim 1, further comprising:

performing polar registration on the first image and the second image so that homonymous points on the first image and the second image are positioned on the same horizontal line;

and carrying out feature extraction on the first image and the second image after polar line registration to generate an initial feature map.

3. The deep learning based binocular disparity estimation method of a thermal imaging image according to claim 1, wherein the step of extracting an initial feature map of the binocular thermal imaging image comprises:

inputting the binocular thermal imaging image into a plurality of layers of convolution kernels with preset sizes;

and sequentially inputting the characteristic images output by the convolution kernels with the multiple layers of preset sizes into a plurality of groups of residual blocks to obtain an initial characteristic image, wherein the plurality of groups of residual blocks comprise residual blocks for deepening the depth of a characteristic extraction network and residual blocks for increasing the receptive field by applying expansion convolution.

4. The deep learning-based thermal imaging image binocular disparity estimation method according to claim 1, wherein the step of acquiring the scale feature map of the initial feature map according to a preset scale comprises:

inputting the initial feature map into an average pooling layer with a preset scale;

inputting the characteristic diagram output by the average pooling layer into a convolution layer with a preset size, wherein the convolution layer is used for adjusting the number of characteristic channels;

and restoring the size of the feature map output by the convolution layer to the feature map size of the initial feature map to obtain a scale feature map.

5. A thermal imaging image binocular disparity estimation method based on deep learning according to claim 3, wherein the step of merging the scale feature map with the initial feature map comprises:

obtaining an output image of the residual block for deepening the depth of the feature extraction network, and obtaining an output image of the residual block for applying the expanded convolution to increase the receptive field;

and combining the output image with the scale feature map to obtain a binocular feature map.

6. The deep learning based thermal imaging image binocular disparity estimation method according to claim 1, wherein the step of constructing a cross-correlation matching cost volume of the binocular feature map comprises:

dividing the feature channels of the first feature map and the second feature map into a plurality of feature groups respectively;

and calculating the inter-group inner products of the feature channels in each feature group of the first feature map and the second feature map under each parallax level, and averaging the inter-group inner products in each feature group to obtain a cross-correlation matching cost volume.

7. The deep learning based thermal imaging image binocular disparity estimation method according to claim 1, wherein the step of constructing a cascade matching cost volume of the binocular feature map comprises:

inputting the first feature map into a convolution layer, and inputting the second feature map into a convolution layer, the convolution layer being used to compress feature channels;

and cascading the first feature map and the second feature map after compressing the feature channels in the feature dimension to obtain a cascade matching cost volume.

8. The deep learning-based thermal imaging image binocular disparity estimation method according to claim 1, wherein the step of encoding and decoding the total matching cost volume to obtain an aggregate feature map of a plurality of network depths comprises:

sequentially inputting the total matching cost roll into two three-dimensional convolution layers, and cascading the output of the second three-dimensional convolution layer with the output of the first three-dimensional convolution layer to obtain a first aggregation feature map;

sequentially inputting the first aggregation feature map into a three-layer coding and decoding network to obtain a second aggregation feature map, a third aggregation feature map and a fourth aggregation feature map which are respectively output by the three-layer coding and decoding network, wherein the coding and decoding network comprises a first coding layer, a second coding layer, a first decoding layer and a second decoding layer; the first coding layer and the second coding layer are used for doubling the number of the characteristic channels and halving the height width of the characteristic map; the first decoding layer and the second decoding layer are used for halving the number of characteristic channels and doubling the height and width of the characteristic map; the first decoding layer and the second decoding layer are transposed convolutions, the output of the first decoding layer is cascaded with the output of the first encoding layer, and the output of the second decoding layer is cascaded with the output of the second layer three-dimensional convolutions layer.

9. The deep learning based thermal imaging image binocular disparity estimation method according to claim 1, wherein the step of generating a disparity map from the feature maps of the plurality of network depths comprises:

combining feature channels of the aggregate feature map;

upsampling the aggregated feature map after merging the feature channels to restore the disparity dimension and feature map size of the aggregated feature map to the image size of the binocular thermal imaging image;

and returning the aggregation feature map based on a preset activation function to obtain a parallax map.

10. A thermal imaging image binocular disparity estimation system based on deep learning, comprising: