CN116957999A

CN116957999A - Depth map optimization method, device, equipment and storage medium

Info

Publication number: CN116957999A
Application number: CN202310460332.7A
Authority: CN
Inventors: 张喆
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-10-27

Abstract

The embodiment of the application discloses a depth map optimization method, a depth map optimization device, depth map optimization equipment and a storage medium, and belongs to the technical field of image processing. The method comprises the following steps: extracting features of a first image and a second image corresponding to the first image to obtain first original depth image features and second original depth image features; respectively carrying out non-local neighborhood matching on the first image and the second image to obtain a first non-local neighborhood sensing characteristic and a second non-local neighborhood sensing characteristic; performing feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature; and generating a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature.

Description

Depth map optimization method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a depth map optimization method, a depth map optimization device, depth map optimization equipment and a storage medium.

Background

Multi-View geometry (MVS) algorithms aim to reconstruct a three-dimensional model of a scene or object from a series of captured unordered images.

In the process of multi-view geometric depth estimation based on deep learning, a convolution neural network with a fixed convolution kernel size is utilized in the related technology to extract features of an original image, so that depth estimation and three-dimensional reconstruction are carried out according to the extracted image features.

Under the condition that a plurality of objects are in the same scene, the depth of a neighborhood pixel point around a certain pixel point is often strongly correlated with the pixel point, but the depth image features of all the pixel points extracted through the fixed convolution kernel generally only consider the influence of surrounding pixel points adjacent to the pixel point on the features of the pixel point, so that serious error depth fusion can be generated at foreground and background boundaries, edge areas and the like, and the three-dimensional reconstruction result is inaccurate.

Disclosure of Invention

The embodiment of the application provides a depth map optimization method, a device, equipment and a storage medium, which can improve the accuracy of multi-view depth estimation. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a depth map optimization method, where the method includes:

Extracting features of a first image and a second image corresponding to the first image to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image, wherein the first image and the second image are images obtained by shooting the same three-dimensional reconstruction object under different view angles;

respectively carrying out non-local neighborhood matching on the first image and the second image to obtain a first non-local neighborhood sensing characteristic corresponding to each pixel point in the first image and a second non-local neighborhood sensing characteristic corresponding to each pixel point in the second image, wherein the non-local neighborhood sensing characteristic comprises pixel point coordinates of at least two non-local neighborhood pixel points matched with the current pixel point in a non-local neighborhood range and influence weights, the influence weights represent the characteristic influence degree of the non-local neighborhood pixel points on the current pixel point, and the non-local neighborhood range is larger than the adjacent pixel point range of the current pixel point;

performing feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature;

And generating a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature.

In another aspect, an embodiment of the present application provides a depth map optimization apparatus, where the apparatus includes:

the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for carrying out feature extraction on a first image and a second image corresponding to the first image to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image, and the first image and the second image are images obtained by shooting the same three-dimensional reconstruction object under different view angles;

the neighborhood matching module is used for carrying out non-local neighborhood matching on the first image and the second image respectively to obtain a first non-local neighborhood sensing characteristic corresponding to each pixel point in the first image and a second non-local neighborhood sensing characteristic corresponding to each pixel point in the second image, wherein the non-local neighborhood sensing characteristic comprises pixel point coordinates of at least two non-local neighborhood pixel points matched with the current pixel point in a non-local neighborhood range and influence weights, and the influence weights represent the characteristic influence degree of the non-local neighborhood pixel points on the current pixel point, and the non-local neighborhood range is larger than the adjacent pixel point range of the current pixel point;

The feature optimization module is used for carrying out feature optimization on the first original depth image feature by utilizing the first non-local neighborhood sensing feature, and carrying out feature optimization on the second original depth image feature by utilizing the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature;

and the first image generation module is used for generating a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature.

In another aspect, embodiments of the present application provide a computer device including a processor and a memory having at least one instruction stored therein, the at least one instruction being loaded and executed by the processor to implement a depth map optimization method as described in the above aspects.

In another aspect, embodiments of the present application provide a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a depth map optimization method as described in the above aspects.

In another aspect, embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the depth map optimization method provided in the above aspect.

In the embodiment of the application, after the first image and the second image corresponding to the first image are subjected to feature extraction to obtain the first original depth image feature and the second original depth image feature, the depth map corresponding to the first image is generated without being directly based on the first original depth image feature and the second original depth image feature, but the first image and the second image are subjected to non-local neighborhood matching to obtain the first non-local neighborhood sensing feature and the second non-local neighborhood sensing feature, the first original depth image feature is subjected to feature optimization by the first non-local neighborhood sensing feature, and the second original depth image feature is subjected to feature optimization by the second non-local neighborhood sensing feature to obtain the first optimized depth image feature and the second optimized depth image feature, so that the depth map corresponding to the first image is generated based on the first optimized depth image feature and the second optimized depth image feature, the optimization efficiency of the depth map for three-dimensional reconstruction is improved, and the accuracy of multi-view depth estimation is improved.

Drawings

FIG. 1 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of a depth map optimization method provided by an exemplary embodiment of the present application;

FIG. 3 illustrates an image edge pixel schematic provided by an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a depth map optimization method provided by an exemplary embodiment of the present application;

FIG. 5 illustrates a non-local neighborhood matching flow chart provided by an exemplary embodiment of the present application;

FIG. 6 illustrates a three-dimensional non-local neighborhood matching schematic provided by an exemplary embodiment of the present application;

FIG. 7 illustrates a flow chart of a depth map optimization method provided by an exemplary embodiment of the present application;

FIG. 8 illustrates a flow chart of a depth map optimization method provided by an exemplary embodiment of the present application;

FIG. 9 illustrates a schematic diagram of depth map optimization using image edge information provided by an exemplary embodiment of the present application;

FIG. 10 illustrates a flow chart of a depth map optimization method provided by an exemplary embodiment of the present application;

FIG. 11 illustrates depth estimation results and three-dimensional reconstruction results for scene number 11 and scene number 13 on a DTU dataset provided by an exemplary embodiment of the application;

FIG. 12 illustrates a block diagram of a depth map optimizing apparatus according to an exemplary embodiment of the present application;

Fig. 13 is a schematic diagram showing a structure of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technologies such as computer vision technology of artificial intelligence, and the like, and is specifically described by the following embodiment.

Referring to FIG. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown. The implementation environment includes a terminal 120 and a server 140. The data communication between the terminal 120 and the server 140 is performed through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal 120 is an electronic device in which an application program having a depth map generation function is installed. The depth map generating function may be a function of an original application in the terminal, or a function of a third party application; the electronic device may be a smart phone, a tablet computer, a personal computer, a wearable device, a vehicle-mounted terminal, or the like, and in fig. 1, the terminal 120 is taken as an example of a personal computer, but the present application is not limited thereto.

The server 140 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like. In the embodiment of the present application, the server 140 may be a background server of an application having a depth map generating function.

In one possible implementation, as shown in fig. 1, there is data interaction between the server 140 and the terminal 120. After determining the first image and the second image corresponding to the first image from the unordered image group corresponding to the three-dimensional reconstruction object, the terminal 120 sends the first image and the second image to the server 140, so that the server 140 performs feature extraction on the first image and the second image to obtain a first original depth image feature and a second original depth image feature, performs non-local neighborhood matching on the first image and the second image to obtain a first non-local neighborhood sensing feature and a second non-local neighborhood sensing feature, so that the server 140 performs feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, performs feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature, and further generates a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature, so that the server 140 sends the depth map to the terminal 120.

Referring to fig. 2, a flowchart of a depth map optimization method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal 120 and/or a server 140) as an example, and the method includes the following steps:

step 201, extracting features of the first image and a second image corresponding to the first image to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image, where the first image and the second image are images obtained by photographing the same three-dimensional reconstruction object under different viewing angles.

In the process of performing three-dimensional reconstruction on a target object based on depth estimation, in order to improve the integrity and accuracy of a three-dimensional reconstruction result, multi-view shooting is generally required to be performed on the target object to obtain a plurality of images under different view angles, so that the three-dimensional reconstruction on the target object is completed according to depth information corresponding to the images under different view angles.

In order to reconstruct a target object in three dimensions, a computer device first acquires a set of images taken from multiple perspectives corresponding to the target object, optionally in a disordered image set. In one possible implementation, to determine the depth map corresponding to the first image in the unordered image group, the computer device may determine, from the unordered image group, a second image paired with the first image, where the second image has a higher matching degree with the first image than that of other images in the unordered image group, and optionally, the matching degree between the images may be expressed as a matching degree between spatial points corresponding to the images.

In one possible implementation, in order to determine the depth value corresponding to each pixel in the first image more accurately, the computer device may select at least two second images from the unordered image group, so as to determine the depth value corresponding to each pixel in the first image by using the second images with multiple different perspectives.

Alternatively, the first image may also be referred to as a reference image, using I ₀ A representation; the second image may also be referred to as the original image, usingAnd (3) representing.

In one possible implementation manner, the computer device performs feature extraction on the first image and at least two second images respectively, so as to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image. Optionally, the image dimensions of the first image and the second image are h×w×3 dimensions, that is, the first image and the second image may be RGB three-channel images with height H and width W, and the image dimensions of the first original depth image feature obtained by feature extraction and the second original depth image feature are h×w×16 dimensions, that is, feature images with height H and width W and 16 channels.

Optionally, the computer device may perform feature extraction on the first image and the second image through a convolutional neural network (Convolutional Neural Network, CNN), and may also perform feature extraction on the first image and the second image through a feature pyramid network (Feature Pyramid Networks, FPN).

Step 202, performing non-local neighborhood matching on the first image and the second image respectively to obtain a first non-local neighborhood sensing feature corresponding to each pixel point in the first image and a second non-local neighborhood sensing feature corresponding to each pixel point in the second image, wherein the non-local neighborhood sensing feature comprises pixel point coordinates and influence weights of at least two non-local neighborhood pixel points matched with the current pixel point in a non-local neighborhood range, and the influence weights represent the feature influence degree of the non-local neighborhood pixel point on the current pixel point, and the non-local neighborhood range is larger than the adjacent pixel point range of the current pixel point.

In the process of extracting the features of the current pixel point through the feature extraction network, the feature extraction network generally only considers the feature influence of surrounding pixel points adjacent to the current pixel point, namely only extracts the depth image features with isotropy, and cannot extract the depth image features with anisotropism, but for the pixel points at the foreground and background juncture or in the edge area, if the feature influence of surrounding pixel points adjacent to the pixel point is considered, the accuracy of depth estimation of the pixel point is often reduced, for example, the pixel point which exists at the foreground and background juncture and corresponds to the foreground is always considered, the pixel depth of the scene foreground object has a forward guiding effect on the pixel point, but the pixel depth of the object in the scene and the background has no influence on the pixel point.

Illustratively, as shown in fig. 3, there is a first pixel point 302 located at a foreground-background junction on the foreground object 301, and the computer device may obtain a non-local neighborhood pixel point corresponding to the first pixel point 302 by performing adaptive non-local neighborhood matching on the first pixel point 302, where the non-local neighborhood pixel point includes, but is not limited to, a pixel point adjacent to the first pixel point 302.

In the embodiment of the application, before generating a depth map based on the depth image features, the computer equipment can perform feature optimization on the original depth image features so as to improve the accuracy of depth estimation of each pixel point.

In one possible implementation manner, in order to improve the anisotropy of feature extraction, the computer device performs non-local neighborhood matching on the first image and the second image, so as to obtain a first non-local neighborhood sensing feature corresponding to each pixel point in the first image and a second non-local neighborhood sensing feature corresponding to each pixel point in the second image. Optionally, the non-local neighborhood matching process may be adaptive, that is, the computer device may automatically adjust the neighborhood matching method, sequence, etc. according to the image features of each pixel, so that different pixels correspond to different non-local neighborhood sensing features.

Optionally, the computer device performs adaptive non-local neighborhood matching on the current pixel in a non-local neighborhood range corresponding to the current pixel, where the non-local neighborhood range is greater than a neighboring pixel range of the current pixel, that is, the non-local neighborhood pixel obtained through non-local neighborhood matching includes, but is not limited to, a pixel surrounding the current pixel and neighboring the current pixel in the image. For example, a pixel point located in the center of the foreground object may be a neighboring pixel point around the pixel point obtained by performing non-local neighborhood matching, and a pixel point located at the edge of the foreground object may be a pixel point in the image, where one side of the foreground object is matched with the pixel point, obtained by performing non-local neighborhood matching.

Optionally, the non-local neighborhood aware feature includes pixel point coordinates and impact weights of at least two non-local neighborhood pixel points within the non-local neighborhood range that match the current pixel point. The pixel point coordinates of the non-local neighborhood pixel point can be expressed as an offset between the non-local neighborhood pixel point and the current pixel point, and the influence weight represents the characteristic influence degree of the non-local neighborhood pixel point on the current pixel point.

And 203, performing feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature.

Further, after obtaining a first non-local neighborhood sensing feature corresponding to the first image and a second non-local neighborhood sensing feature corresponding to the second image, the computer device performs feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and performs feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature, thereby obtaining a first optimized depth image feature and a second optimized depth image feature.

In one possible implementation, the computer device adds the original depth image features corresponding to each pixel point to the non-local neighborhood aware features, thereby obtaining optimized depth image features corresponding to each pixel point, where the adding process refers to an adding process between elements at a pixel level.

In one possible implementation manner, in order to improve the efficiency of feature optimization, after obtaining the first non-local neighborhood perceptual feature and the second non-local neighborhood perceptual feature, the computer device may further perform image dimension matching on the non-local neighborhood perceptual feature and the original depth image feature, so that the non-local neighborhood perceptual feature and the original depth image feature remain consistent in image dimension, and then perform feature optimization.

Step 204, generating a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature.

In one possible implementation manner, the computer device estimates the depth value of each pixel point on the first image according to the first optimized depth image feature corresponding to the first image and the second optimized depth image features corresponding to the at least two second images, so as to generate a depth map corresponding to the first image.

In one possible implementation manner, the computer device sequentially performs depth estimation on each image in the unordered image group corresponding to the three-dimensional reconstruction object through the method, so as to obtain a depth map corresponding to each image, and further performs three-dimensional reconstruction based on multiple depth maps, so as to obtain a three-dimensional reconstruction result.

In summary, in the embodiment of the present application, after the first image and the second image corresponding to the first image are extracted to obtain the first original depth image feature and the second original depth image feature, the depth map corresponding to the first image is not generated directly based on the first original depth image feature and the second original depth image feature, but the first image and the second image are first subjected to non-local neighborhood matching to obtain the first non-local neighborhood sensing feature and the second non-local neighborhood sensing feature, the first original depth image feature is respectively subjected to feature optimization by the first non-local neighborhood sensing feature, and the second original depth image feature is subjected to feature optimization by the second non-local neighborhood sensing feature to obtain the first optimized depth image feature and the second optimized depth image feature, so that the depth map corresponding to the first image is generated based on the first optimized depth image feature and the second optimized depth image feature, the optimization efficiency of the depth map for three-dimensional reconstruction is improved, and the accuracy of multi-view depth estimation is increased.

In one possible implementation manner, in order to perform non-local neighborhood matching on the pixel points within the scope of the receptive field as large as possible, so as to improve accuracy of non-local neighborhood sensing characteristics, the computer device may perform image processing on the first image and the second image before performing non-local neighborhood matching on the first image and the second image, so that the processed images are subjected to non-local neighborhood matching.

Referring to fig. 4, a flowchart of a depth map optimization method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal 120 and/or a server 140) as an example, and the method includes the following steps:

step 401, extracting features of the first image and a second image corresponding to the first image to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image, where the first image and the second image are images obtained by shooting the same three-dimensional reconstruction object under different viewing angles.

For the specific implementation of step 401, reference may be made to step 201, and this embodiment is not described herein.

Step 402, performing color channel weighted combination on the first image and the second image to obtain a first weighted combined image and a second weighted combined image, where the first weighted combined image and the second weighted combined image are single-channel images.

Optionally, the first image and the second image are RGB images, and have three channels of red, green and blue. In one possible implementation, the computer device may assign different weights to the different color channels in consideration of the robustness of the different color channels to the anisotropic feature, and perform color channel weighted combination on the first image and the second image based on the weight values, thereby improving the robustness of the non-local neighborhood aware feature.

Optionally, since the blue channel of the image is more robust to anisotropic features, the computer device may increase the weight corresponding to the blue channel, such as setting the weight of the blue channel to 1.0, the weight of the green channel to 0.3, and the weight of the red channel to 0.1, during the color channel weighted combining process.

In one possible implementation manner, the computer device weights the red, green and blue color channels of the first image and the second image based on the weight values corresponding to the color channels, so as to obtain a first weighted combined image and a second weighted combined image, where the first weighted combined image and the second weighted combined image obtained by weighted combination are all single-channel images, that is, the dimension of the image after weighted combination is changed from the dimension of H×W×3 to the dimension of H×W×1.

Alternatively, the blue channel image may be represented as I _B The green channel image may be represented as I _G The red channel image may be represented as I _R The blue channel weight may be denoted as w _B Green general purpose medicineThe track weight may be expressed as w _G The red channel weight may be expressed as w _R The process of channel weighted combining can thus be expressed as

Illustratively, as shown in fig. 5, the first image corresponds to a blue channel 501, a green channel 502, and a red channel 503, where the weight of the blue channel 501 is 1.0, the weight of the green channel 502 is 0.3, and the weight of the red channel 503 is 0.1, so that the computer device performs the combining weighting for the three color channels.

Step 403, downsampling the first weighted combined image and the second weighted combined image based on the sampling interval to obtain a first sampled image and a second sampled image.

In one possible implementation, to expand the receptive field range of the non-local neighborhood match, the non-local neighborhood pixel point of the current pixel point is determined as much as possible in a larger non-local neighborhood range, and weights are shared among different channels, and after the computer device obtains a sampling interval, the first weighted combined image and the second weighted combined image are downsampled based on the sampling interval, so as to obtain a first sampled image and a second sampled image.

Alternatively, the sampling interval is r, and the image dimension of the sampled image obtained by downsampling can be expressed as H/rXW/rXr ² The dimension, the sampled image is more compact relative to the weighted combined image.

Schematically, as shown in fig. 5, the sampling interval is 3 pixels, and the computer device downsamples the weighted combined image according to the sampling interval to obtain H/3×w/3×3 ² Compact sampled image 504 in dimension.

Step 404, determining a first non-local neighborhood range based on the image dimension corresponding to the first sampled image.

In one possible implementation, the computer device determines a first non-local neighborhood range in the non-local neighborhood matching process according to the image dimension corresponding to the first sampling image, and a number of non-local neighborhood pixels determined in the first non-local neighborhood range.

Optionally, the first non-local neighborhood range may be equal to or smaller than an image dimension corresponding to the first sampled image, which is not limited by the embodiment of the present application.

Optionally, the first non-local neighborhood range and the number of non-local neighborhood pixels may be set by an image processor according to needs, and the computer device directly obtains and applies the first non-local neighborhood range and the number of non-local neighborhood pixels.

Step 405, performing non-local neighborhood matching on the first sampled image based on the first non-local neighborhood range, to obtain a first non-local neighborhood sensing feature.

Further, for each pixel point in the first sampling image, the computer equipment performs self-adaptive non-local neighborhood search matching on the pixel point in a first non-local neighborhood range corresponding to each pixel point, so as to obtain a first non-local neighborhood sensing characteristic corresponding to each pixel point. Optionally, the current pixel point is located at a center position of the first non-local neighborhood range; alternatively, in the case that the current pixel point is located at the image edge position, the current pixel point may also be located at the edge position in the first non-local neighborhood range.

In one possible implementation manner, the computer device may input the first sampled image into a first coordinate determination network and a first weight determination network, obtain, by using the first coordinate determination network, pixel point coordinates of non-local neighborhood pixel points corresponding to each pixel point in the first image, and obtain, by using the first weight determination network, an influence weight of the non-local neighborhood pixel points corresponding to each pixel point in the first image.

Schematically, as shown in fig. 5, the computer device performs non-local neighborhood matching on the first sampled image, obtains the pixel point coordinates of the non-local neighborhood pixel points corresponding to each pixel point through the first coordinate determining network 505, and obtains the influence weights of the non-local neighborhood pixel points corresponding to each pixel point through the first weight determining network 506, thereby obtaining the first non-local neighborhood sensing feature corresponding to each pixel point, and the non-local neighborhood sensing feature shown in fig. 5 is the non-local neighborhood sensing feature of the first pixel point 507.

Alternatively, the network structures of the first coordinate determining network G and the first weight determining network K may be a Unet network structure, and the coordinates of the non-local neighborhood pixel point may be represented as O ^p ∈2k ² The influence weight of the non-local neighborhood pixel point can be expressed as W by x 1 ^p ∈k ² X 1, determining the coordinates O of the non-local neighborhood pixel point by the first coordinate determination network G ^p Determining an impact weight W of a non-local neighborhood pixel point by a first weight determination network K ^p Can be expressed as respectively

Wherein,,and (3) representing a sampling image of the input network, wherein p (x, y) is the pixel point coordinate of the current pixel point in the image, and r is the sampling interval.

In one possible implementation manner, the computer device determines, through a first weight determining network, an influence weight of each pixel point in a first non-local neighborhood range of the current pixel point on the current pixel point, so as to sort the influence weights corresponding to each pixel point in a descending order, determine, based on the number of the non-local neighborhood pixel points, a non-local neighborhood pixel point corresponding to the current pixel point, and determine, through a first coordinate determining network, a pixel point coordinate of the non-local neighborhood pixel point. For example, the number of non-local neighborhood pixels is 9, and the computer device determines the pixel with the influence weight of the current pixel ranked 9 as the non-local neighborhood pixel, so that the pixel coordinates and the influence weights of the 9 non-local neighborhood pixels are determined as the first non-local neighborhood sensing feature of the current pixel, and further the first non-local neighborhood sensing feature corresponding to the first image is obtained.

Step 406, determining a second non-local neighborhood range based on the image dimension corresponding to the second sampled image.

In one possible implementation, the computer device determines a second non-local neighborhood range in the non-local neighborhood matching process according to the image dimension corresponding to the second sampled image, and the number of non-local neighborhood pixels determined in the second non-local neighborhood range.

Optionally, the second non-local neighborhood range may be equal to or smaller than the image dimension corresponding to the second sampled image, which is not limited by the embodiment of the present application.

Optionally, the second non-local neighborhood range and the number of non-local neighborhood pixels may be set by an image processor according to needs, and the computer device directly obtains and applies the second non-local neighborhood range and the number of non-local neighborhood pixels.

Step 407, performing non-local neighborhood matching on the second sampled image based on the second non-local neighborhood range to obtain a second non-local neighborhood sensing feature.

Further, for each pixel point in the second sampling image, the computer device performs self-adaptive non-local neighborhood search matching on the pixel point in a second non-local neighborhood range corresponding to each pixel point, so as to obtain second non-local neighborhood sensing characteristics corresponding to each pixel point. Optionally, the current pixel is located at a center position of the second non-local neighborhood range; alternatively, in the case that the current pixel point is located at the image edge position, the current pixel point may also be located at the edge position in the second non-local neighborhood range.

In one possible implementation manner, the computer device may input the second sampled image into the first coordinate determining network and the first weight determining network, obtain, by using the first coordinate determining network, pixel point coordinates of non-local neighborhood pixels corresponding to each pixel point in the second image, and obtain, by using the first weight determining network, an influence weight of the non-local neighborhood pixels corresponding to each pixel point in the second image, thereby obtaining a second non-local neighborhood perceptual feature corresponding to the second image.

Step 408, performing image dimension matching on the first non-local neighborhood sensing feature based on the image dimension of the first original depth image feature to obtain a first non-local neighborhood sensing feature with the dimension matched, wherein the image dimension of the first non-local neighborhood sensing feature with the dimension matched is the same as the image dimension of the first original depth image feature.

In one possible implementation manner, since the first non-local neighborhood perceptual feature is formed by pixel coordinates of a non-local neighborhood pixel point corresponding to each pixel point in the image and an influence weight, that is, the first non-local neighborhood perceptual feature belongs to a feature domain, before performing feature optimization, the computer device may perform image dimension matching on the first non-local neighborhood perceptual feature according to an image dimension of the first original depth image feature, and convert the first non-local neighborhood perceptual feature from the feature domain to a spatial domain, so that an image dimension of the first non-local neighborhood perceptual feature after dimension matching is the same as an image dimension of the first original depth image feature.

In one possible implementation, the computer device may transform the first non-local neighborhood aware feature from a feature domain to a spatial domain by image dimension matching the first non-local neighborhood aware feature by Pixel-shuffling (PS).

Step 409, performing image dimension matching on the second non-local neighborhood sensing feature based on the image dimension of the second original depth image feature to obtain a second non-local neighborhood sensing feature with the dimension matched, wherein the image dimension of the second non-local neighborhood sensing feature with the dimension matched is the same as the image dimension of the second original depth image feature.

In one possible implementation manner, since the second non-local neighborhood perceptual feature is formed by pixel coordinates of a non-local neighborhood pixel point corresponding to each pixel point in the image and an influence weight, that is, the second non-local neighborhood perceptual feature belongs to a feature domain, in order to improve efficiency of feature optimization on the second original depth image feature, before performing feature optimization, the computer device may perform image dimension matching on the second non-local neighborhood perceptual feature according to an image dimension of the second original depth image feature, and convert the second non-local neighborhood perceptual feature from the feature domain to a spatial domain, so that an image dimension of the second non-local neighborhood perceptual feature after dimension matching is the same as an image dimension of the second original depth image feature.

In one possible implementation, the computer device may transform the second non-local neighborhood aware feature from a feature domain to a spatial domain by image dimension matching the second non-local neighborhood aware feature by Pixel-shuffling (PS).

And 410, performing feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature after dimension matching to obtain a first optimized depth image feature.

In one possible implementation, after performing image dimension matching on the first non-local neighborhood aware feature, the computer device performs feature optimization on the first original depth image feature using the dimension-matched first non-local neighborhood aware feature to obtain a first optimized depth image feature.

Optionally, the computer device may perform pixel-level element addition on the original depth image features corresponding to each pixel point and the non-local neighborhood perceptual features, so as to obtain optimized depth image features corresponding to each pixel point.

In one possible implementation, the process of feature optimization may be expressed as

Wherein n=k ² For the number of pixels of the non-local neighborhood pixel corresponding to each pixel, p (x, y) is the non-local neighborhood pixel, For the offset of the non-local neighborhood pixel point relative to the current pixel point in the horizontal dimension, +.>The offset of the non-local neighborhood pixel point relative to the current pixel point in the vertical dimension is W ^p Influence weight for non-local neighborhood pixel point, +.>And +. _base Is an original depth image feature.

And step 411, performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature after dimension matching to obtain a second optimized depth image feature.

In one possible implementation, after performing image dimension matching on the second non-local neighborhood aware feature, the computer device performs feature optimization on the second original depth image feature using the dimension-matched second non-local neighborhood aware feature to obtain a second optimized depth image feature.

Step 412, obtaining a first camera parameter and a first camera viewpoint corresponding to the first image, and a second camera parameter corresponding to the second image.

In one possible implementation, to improve the accuracy of depth estimation of each pixel in the first image, the computer device may first make a depth assumption on the depth value of each pixel based on the depth image feature, so that the computer device first needs to obtain a first camera parameter corresponding to the first image, and optionally, a second camera parameter corresponding to the second image, where the camera parameters include a camera intrinsic parameter and a camera extrinsic parameter, and the camera extrinsic parameter includes a rotation matrix and a translation vector corresponding to the original depth image feature.

In one possible implementation, considering that the first image is captured by the first camera, and the depth value of each pixel point in the first image characterizes the distance from the camera viewpoint to the surface of the object in the space, the computer device needs to acquire the first camera viewpoint corresponding to the first image.

Step 413, performing differential projection transformation on the first optimized depth image feature and the second optimized depth image feature along the first camera viewpoint direction based on the first camera parameter and the second camera parameter to obtain a first feature corresponding to the first image and a second feature corresponding to the second image, wherein the feature is a conical three-dimensional space along the first camera viewpoint direction, and represents the image features of each pixel point in different depth hypothesis layers.

In one possible implementation manner, after the first camera parameter and the second camera parameter are acquired, in order to perform depth assumption on each pixel point in the first image, the computer device performs differential projection transformation on the first optimized depth image feature and the second optimized depth image feature along the first camera viewpoint direction, that is, the first optimized depth image feature and the second optimized depth image feature are projected onto a parallel projection plane in front of the first camera view cone, so as to obtain a first feature corresponding to the first image and a second feature corresponding to the second image, so that the first feature and the second feature overlap at the same spatial position.

In one possible implementation, the computer device further needs to determine a transformation relationship between the first image and the second image in order to increase accuracy of the projective transformation in performing the differentiable projective transformation of the second optimized depth image feature along the first camera viewpoint direction, considering that the second image is an image formed along the second camera viewpoint direction. Optionally, the computer device may determine a pair of well-matched feature pixels from the first optimized depth image feature and the second optimized depth image feature p ₁ And p ₂ Thereby according to the characteristic pixel point p ₁ And p ₂ And (3) transforming to obtain a homography matrix, and further performing differential projection transformation on the second optimized depth image feature along the direction of the first camera viewpoint based on the homography matrix to obtain a second feature body.

Alternatively, the process of differentiable projective transformation of optimized depth image features may be expressed asWherein K is _i And T _i Camera intrinsic and camera extrinsic of the second camera, K ₀ And T ₀ Camera intrinsic and camera extrinsic of the first camera, respectively, d is the depth hypothesis, from [ d ] _min ,d _max ]Evenly sampling to obtain the final product.

Optionally, the feature body is a conical stereoscopic space along the direction of the first camera viewpoint, and represents image features of each pixel point on the assumption layer at different depths. Illustratively, the dimension of the optimized depth image feature is h×w×c dimension, the dimension of the feature body obtained by performing depth assumption through differential projection transformation is h×w×c×d dimension, where W is the width of the first image, H is the height of the first image, C is the number of feature channels, and D is the number of depth samples.

In step 414, cost measurement is performed on the first feature and the second feature, so as to obtain a cost body corresponding to the first image, where the cost body is a three-dimensional structure formed by connecting the cost graphs in the depth direction, and each pixel point on the cost graphs represents a matching cost of a corresponding pixel point on the first image and a corresponding pixel point on the second image in the depth value.

Further, the computer device performs cost measurement on the first feature body and the second feature body to obtain a cost body corresponding to the first image, namely, performs cost measurement on similarity of a plurality of feature bodies in the same spatial position to form the cost body.

Optionally, the cost body is a three-dimensional structure formed by connecting the cost graphs in the depth direction, and each pixel point on the cost graphs represents the matching cost of the corresponding pixel point on the first image and the corresponding pixel point on the second image in the depth value. The computer equipment obtains matching cost through cost measurement on image features corresponding to each pixel point in different depth hypothesis layers in the feature bodies, so that the matching cost corresponding to each pixel point in the different depth hypothesis layers forms a cost body, wherein the dimension of the cost body is H multiplied by W multiplied by C multiplied by D.

In one possible implementation manner, the computer device may perform cost measurement on the similarity of the multiple feature volumes at the same spatial position by adopting a measurement manner based on variance, so as to obtain a cost volume corresponding to the first image. Cost measurement can be performed on any number of feature volumes by a measurement mode based on variance, and differences among depth image features are explicitly measured.

In one possible implementation manner, in order to improve accuracy of cost measurement, before the cost measurement is performed on the first feature and the second feature to obtain the cost body, the computer device may further perform non-local neighborhood matching on the first feature in the three-dimensional space, and perform feature optimization on the first feature through the obtained non-local neighborhood sensing feature.

In one possible implementation manner, the computer device performs non-local neighborhood matching on the first feature body in the three-dimensional space to obtain a third non-local neighborhood sensing feature corresponding to each voxel in the first feature body, wherein the third non-local neighborhood sensing feature comprises voxel coordinates of at least two non-local neighborhood voxels matched with the current voxel and influence weights. Similar to performing non-local neighborhood matching on the image in the two-dimensional space, the computer equipment firstly determines a third non-local neighborhood range corresponding to the first feature body, so that based on the third non-local neighborhood range, the second coordinate determining network is used for obtaining the volume pixel coordinates of the non-local neighborhood volume pixels corresponding to the volume pixels in the first feature body, and the second weight determining network is used for obtaining the influence weights of the non-local neighborhood volume pixels corresponding to the volume pixels in the first feature body, so that the third non-local neighborhood sensing feature is obtained. Schematically, fig. 6 shows a non-local neighborhood aware feature corresponding to a first voxel 601.

Optionally, aThe network structure of the second coordinate determination network G and the second weight determination network K can be U-net network structure, and the computer equipment can learn the position offset O of the three-dimensional non-local neighborhood volume pixels through the three-dimensional second coordinate determination network G and the second weight determination network K ^q ∈3k ³ X 1 and influence weight W ^q ∈k ³ ×1×1。

Further, the computer device performs feature optimization on the first feature body through the third non-local neighborhood sensing feature to obtain a first optimized feature body, so as to perform cost measurement on the first optimized feature body and the second feature body to obtain a cost body corresponding to the first image, wherein the process of obtaining the first optimized feature body through feature optimization can be expressed as

Wherein N is ^′ ＝k ³ For the number of voxels of the non-local neighborhood voxel corresponding to each voxel, q (x, y, d) is the non-local neighborhood voxel,offset in the horizontal dimension relative to the current pixel for the non-local neighborhood volume pixel,/->Offset in the vertical dimension relative to the current pixel for the non-local neighborhood volume pixel,/-, is provided>Offset, W, in the depth dimension for a non-local neighborhood volume pixel relative to a current volume pixel ^q Influence weight for non-local neighborhood volume pixel, +.>The sum +.p represents the addition and multiplication of the elements at the pixel level of the body, V (V) _base Is a first feature.

And 415, regularizing the cost body, and regressing the cost body along the depth direction to obtain a probability body, wherein the probability body represents the depth probability of each pixel point in each depth hypothesis layer.

In a possible implementation manner, after the computer device obtains the cost body through the cost measurement, the depth probability of each pixel point in each depth hypothesis layer can be determined according to the matching cost corresponding to each depth hypothesis layer in the cost body, wherein the smaller the matching cost value is, the higher the similarity of corresponding pixel points in the same spatial position in a plurality of feature bodies is, and thus the greater the depth probability of the pixel points in the corresponding depth hypothesis layers is.

In one possible implementation manner, the computer device may perform regularization processing on the cost body through a 3D convolutional neural network (Convolutional Neural Networks, CNN), so that the cost body is more robust, and perform regression on matching costs of the pixel points in each depth hypothesis layer along the depth direction, so as to obtain depth probabilities of each pixel point in each depth hypothesis layer, and thus obtain a probability body, where the probability body represents depth probabilities of each pixel point in each depth hypothesis layer, and optionally, a dimension of the probability body is h×w×1×d.

Step 416, determining depth values corresponding to the pixel points based on the probability body, and obtaining a depth map corresponding to the first image.

Further, the computer device may determine a depth value corresponding to each pixel point in the first image according to the depth probability of each pixel point represented by the probability body in each depth hypothesis layer, so as to obtain a depth map corresponding to the first image.

In one possible implementation manner, the computer device may determine, from the probability body, a depth hypothesis layer with the greatest depth probability among the depth hypothesis layers corresponding to each pixel point by adopting a Winner Take All (WTA) policy, as a target depth hypothesis layer, so as to determine a depth value corresponding to the target depth hypothesis layer as a target depth value, and further obtain a depth map corresponding to the first image.

In one possible implementation manner, the computer device performs weighted summation on the depth probability and the depth value of each depth hypothesis layer corresponding to each pixel point, so that the depth value obtained by the weighted summation is used as a target depth value, and further a depth map corresponding to the first image is obtained.

In the above embodiment, before non-local neighborhood matching is performed on the first image and the second image, different color channels are considered to have different degrees of robustness on anisotropic features, so that weighted combined images are obtained by setting different weight values for the different color channels and performing color channel weighted combination, and in order to perform non-local neighborhood matching in a larger receptive field range, the weighted combined images can be sampled based on sampling intervals, so that non-local neighborhood matching is performed on the sampled images, the accuracy of non-local neighborhood sensing features is improved, and further the effect of depth map optimization is enhanced.

In one possible implementation manner, the computer device may perform depth estimation on the first image by constructing a multi-view depth estimation network, and optionally, the multi-view depth estimation network includes a feature pyramid network, a non-local neighborhood aware network, a feature optimization network, and a depth map generation network, where the feature pyramid network is used for performing feature extraction on the first image and the second image to obtain a first original depth image feature and a second original depth image feature; the non-local neighborhood sensing network is used for performing non-local neighborhood matching on the first image and the second image to obtain a first non-local neighborhood sensing characteristic and a second non-local neighborhood sensing characteristic; the feature optimization network is used for performing feature optimization on the original depth image features by utilizing the non-local neighborhood perception features to obtain optimized depth image features; the depth map generation network is used for generating a depth map corresponding to the first image according to the first optimized depth image feature and the second optimized depth image feature.

In one possible implementation, to improve the accuracy of the multi-view depth estimation network depth estimation, the computer device may also optimally train the multi-view depth estimation network with depth loss.

Referring to fig. 7, a flowchart of a depth map optimization method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal 120 and/or a server 140) as an example, and the method includes the following steps:

in step 701, feature extraction is performed on a first sample image and a second sample image corresponding to the first sample image through a feature pyramid network in the multi-view depth estimation network, so as to obtain a first sample original depth image feature corresponding to the first sample image and a second sample original depth image feature corresponding to the second sample image, where the first sample image and the second sample image are images obtained by photographing the same sample three-dimensional reconstruction object under different viewing angles.

In one possible implementation manner, in order to perform three-dimensional reconstruction on a sample three-dimensional reconstruction object, the computer device acquires a group of unordered sample image groups obtained by shooting the sample three-dimensional reconstruction object under different viewing angles, and further, in order to perform depth estimation on a first sample image, the computer device determines at least two second sample images matched with the first sample image from the unordered sample image groups, so that feature extraction is performed on the first sample image and the second sample image through a feature pyramid network in the multi-view depth estimation network, and thus original depth image features of the first sample and original depth image features of the second sample are obtained.

Step 702, performing non-local neighborhood matching on the first sample image and the second sample image through a non-local neighborhood sensing network in the multi-view depth estimation network, so as to obtain a first sample non-local neighborhood sensing characteristic corresponding to each sample pixel point in the first sample image and a second sample non-local neighborhood sensing characteristic corresponding to each sample pixel point in the second sample image.

In one possible implementation manner, in order to fully consider sample non-local neighborhood sensing characteristics possibly existing in a current sample pixel point in a first sample image, namely to determine influence weights of other non-local neighborhood sample pixel points except for sample pixel points adjacent to the current sample pixel point on the current sample pixel point, and to determine pixel point coordinates of the non-local neighborhood sample pixel points, a computer device performs non-local neighborhood matching on the first sample image and the second sample image through a non-local neighborhood sensing network in a multi-view depth estimation network, so as to obtain the first sample non-local neighborhood sensing characteristics and the second sample non-local neighborhood sensing characteristics.

Optionally, the Non-local neighborhood sensing network may be an Adaptive Non-local Neighbors Matching (ANNM) neighborhood sensing network, and the computer device may input the sample image into the ANNM network, and perform color channel combination and sampling on the sample image through the ANNM network, and further perform Adaptive Non-local neighborhood matching on the sampled sample image to obtain a Non-local neighborhood sensing feature corresponding to the sample image.

Step 703, performing feature optimization on the first sample original depth image feature by using the first sample non-local neighborhood perceptual feature and performing feature optimization on the second sample original depth image feature by using the second sample non-local neighborhood perceptual feature through a feature optimization network in the multi-view depth estimation network, so as to obtain a first sample optimized depth image feature and a second sample optimized depth image feature.

Further, the computer device performs feature optimization on the first sample original depth image feature by using the first sample non-local neighborhood sensing feature and performs feature optimization on the second sample original depth image feature by using the second sample non-local neighborhood sensing feature through a feature optimization network in the multi-view depth estimation network, so as to obtain the first sample optimized depth image feature and the second sample optimized depth image feature.

Optionally, the computer device may perform image dimension matching on the sample non-local neighborhood perceptual feature through the feature optimization network, and perform pixel-level element addition on the sample non-local neighborhood perceptual feature and the sample original depth image feature, so as to obtain a sample optimized depth image feature.

Step 704, generating, by a depth map generation network in the multi-view depth estimation network, a sample depth map corresponding to the first sample image based on the first sample optimized depth image feature and the second sample optimized depth image feature.

In one possible implementation, after obtaining the first sample-optimized depth image feature and the second sample-optimized depth image feature, the computer device generates a sample depth map corresponding to the first sample image through a depth map generation network in a multi-view depth estimation network.

Optionally, the depth map generating network may include multiple sub-networks, such as a 3D ANNM network and a 3D CNN network, in one possible implementation manner, the computer device performs differential projective transformation on the first sample optimized depth image feature and the second sample optimized depth image feature through the depth map generating network to obtain a first sample feature and a second sample feature, further determines a sample non-local neighborhood perception feature corresponding to the first sample feature through the 3D ANNM network, performs feature optimization on the first sample feature by using the sample non-local neighborhood perception feature to obtain a first optimized sample feature, further performs cost measurement on the first sample optimized feature and the second sample feature in a variance-based measurement manner to obtain a sample cost body, thereby performing regularization processing on the sample cost body through the 3D CNN network, and performing regression in a depth direction to obtain a sample probability body, thereby determining depth values corresponding to each sample pixel point in the first sample image based on the sample probability body through a WTA policy, and obtaining a sample depth map.

Step 705, determining a target depth loss based on the real depth map corresponding to the first sample image and the sample depth map.

In one possible implementation manner, after obtaining the sample depth map corresponding to the first sample image through the multi-view depth estimation network, in order to perform training optimization on the multi-view depth estimation network, the computer device may determine the target depth loss according to the real depth map corresponding to the first sample image and the sample depth map.

In one possible implementation manner, in order to improve the network training efficiency, the depth distribution of the sample depth map is as close to the depth distribution of the real depth map as possible, and the computer device may perform cross entropy calculation on the depth distribution corresponding to the real depth map and the depth distribution corresponding to the sample depth map through a cross entropy loss function, so as to obtain the first depth loss.

Alternatively, the process of determining the first depth loss by the computer device through cross entropy calculation may be expressed as

Wherein, loss _CE Representing a first depth loss, P _GT And P is the depth distribution corresponding to the sample depth map, omega represents the effective pixel point in the real depth distribution, and P is the sample pixel point.

In one possible implementation manner, in order to improve the decorrelation between the pixel points in the sample depth map, so that the energy in the sample depth map is concentrated in the low-frequency component of the spectrum space, the computer device may further process the sample depth map by performing discrete cosine (Discrete Cosine Transform, DCT) transformation on the sample depth map, and further determine the second depth loss based on the depth distribution corresponding to the real depth map and the depth distribution corresponding to the DCT-transformed sample depth map.

In one possible implementation, the computer device divides the sample depth map into image blocks of different frequency components by DCT transformation, quantizes the image blocks, discards high frequency components during quantization, retains low frequency components, and composes a new image by quantizing the compressed image blocks.

Alternatively, the process of determining the second depth loss by the computer device based on the DCT transform may be expressed as

Wherein, loss _EA Representing a second depth lossLoss of P _GT And P is the depth distribution corresponding to the sample depth map, omega represents the effective pixel point in the real depth distribution, and P is the sample pixel point.

Step 706, optimizing the multi-view depth estimation network with a target depth penalty.

Furthermore, the computer equipment performs optimization training on the multi-view depth estimation network with target depth loss, and accuracy of image depth estimation of the multi-view depth estimation network is improved.

In one possible implementation, after deriving the first depth penalty and the second depth penalty, the computer device determines a target depth penalty based on the penalty weights corresponding to the respective depth penalties, thereby optimizing the multi-view depth estimation network.

Optionally, the loss weights corresponding to the first depth loss and the second depth loss may be the same, or may be determined according to the actual features of the image, or optionally, the loss weights may be set by the image processor, which is not limited in the embodiment of the present application.

Alternatively, the process of determining the target depth loss based on the first depth loss and the second depth loss may be expressed as

Loss＝λ ₁ Loss _CE +λ ₂ Loss _EA

Wherein, loss _CE Lambda is the first depth loss ₁ Loss weight corresponding to first depth Loss, loss _EA Lambda is the second depth loss ₂ And loss weight corresponding to the second depth loss.

Alternatively, in the presence of an m-layer depth map estimate, the target depth loss may be represented as

In the above embodiment, in the process of determining the depth loss based on the depth distribution corresponding to the real depth map and the depth distribution corresponding to the sample depth map, in addition to obtaining the first depth loss by the cross entropy calculation method, the discrete cosine transform is performed on the depth distribution corresponding to the sample depth map, so that the second depth loss is determined and obtained based on the depth distribution after the transform and the depth distribution corresponding to the real depth map, and the multi-view depth estimation network is optimized by the first depth loss and the second depth loss, so that the network training efficiency is improved, the depth distribution of the sample depth map is as close to the depth distribution of the real depth map as possible, and the decorrelation among all pixels in the sample depth map is improved, so that the energy in the sample depth map is concentrated in the low-frequency component of the spectrum space.

By adopting the method provided by the embodiment of the application, the corresponding depth map can be obtained through the end-to-end multi-view depth estimation network only by obtaining a group of unordered image groups corresponding to the three-dimensional reconstruction object under different view angles, and then the three-dimensional reconstruction result is obtained.

In one possible implementation, in the process of performing differential projection transformation on the first optimized depth image feature and the second optimized depth image feature and performing depth assumption on each pixel point, considering that a large assumption error may exist in directly performing depth assumption on each pixel point in the first image, the computer device may perform depth estimation on each layer of sampled image from coarse to fine in the process of upsampling by performing downsampling on the first image and then upsampling on the second image.

Referring to fig. 8, a flowchart of a depth map optimization method according to an exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal 120 and/or a server 140) as an example, and the method includes the following steps:

step 801, extracting features of the first image and the second image through a feature pyramid network to obtain m layers of first original depth image features corresponding to the first image and m layers of second original depth image features corresponding to the second image.

In one possible implementation manner, in the process of extracting features of the first image and the second image by the computer device through the m-layer feature pyramid network, the m-layer feature pyramid network performs downsampling on the first image and the second image first to obtain m-layer first sampled images and m-layer second sampled images, and in the process of starting upsampling from the first-layer first sampled images and the first-layer second sampled images, the m-layer feature pyramid network performs feature extraction on the m-layer first sampled images and the m-layer second sampled images sequentially to obtain m-layer first original depth image features and m-layer second original depth image features.

In an illustrative example, the feature pyramid network has 4 layers from thick to thin, and features of the images are sequentially extracted to obtain 4 layers of original depth image features, wherein the first layer of original depth image features correspond to a first layer of sampling images, and the first layer of sampling images are 1/8 images; the second layer original depth image features correspond to a second layer sampling image, and the second layer sampling image is a 1/4 image; the original depth image features of the third layer correspond to the sampling images of the third layer, and the sampling images of the third layer are 1/2 images; the fourth layer original depth image features correspond to a fourth layer sample image, the fourth layer sample image being the same size as the image.

Step 802, performing non-local neighborhood matching on the first image and the second image respectively to obtain m layers of first non-local neighborhood sensing features corresponding to each pixel point in the first image and m layers of second non-local neighborhood sensing features corresponding to each pixel point in the second image.

In one possible implementation manner, in order to perform feature optimization on m layers of original depth image features, the computer device needs to obtain m layers of non-local neighborhood perception features respectively, which are in one-to-one correspondence with the m layers of original depth image features.

In one possible implementation manner, the computer device resamples the first image and the second image to obtain m-layer first sampled images and m-layer second sampled images, so as to perform non-local neighborhood matching on the m-layer first sampled images and the m-layer second sampled images respectively, and obtain m-layer first non-local neighborhood sensing features corresponding to the m-layer first sampled images and m-layer second non-local neighborhood sensing features corresponding to the m-layer second sampled images.

And 803, performing feature optimization on the first original depth image feature of the ith layer by using the first non-local neighborhood sensing feature of the ith layer, and performing feature optimization on the second original depth image feature of the ith layer by using the second non-local neighborhood sensing feature of the ith layer to obtain the first optimized depth image feature of the ith layer and the second optimized depth image feature of the ith layer, wherein i is more than or equal to 1 and less than or equal to m.

In a possible implementation manner, in the process of obtaining m layers of original depth image features and m layers of non-local neighborhood perception features, the computer equipment further needs to perform feature optimization on each layer of original depth image features sequentially from the first layer, and optionally, the computer equipment performs feature optimization on the i layers of original depth image features by using the i layers of non-local neighborhood perception features, so as to obtain i layers of first optimized depth image features and i layers of second optimized depth image features, wherein i is an integer, and i is not less than 1 and not more than m.

Step 804, generating an mth depth map corresponding to the first image based on the mth layer first optimized depth image feature and the mth layer second optimized depth image feature.

In a possible implementation manner, in order to improve accuracy of the depth map corresponding to the first image, the computer device may perform depth estimation on each layer of sampled image from coarse to fine, so that in a process of generating the i+1th depth map, a coarser i depth map is used for performing depth hypothesis guidance, and further the computer device generates an mth depth map corresponding to the first image based on the mth layer first optimized depth image feature and the mth layer second optimized depth image feature, where the mth depth map is a target depth map corresponding to the first image.

In one possible implementation manner, in a process of performing differential projection transformation on the i+1th layer first optimized depth image feature and the i+1th layer second optimized depth image feature along the first camera viewpoint direction based on the first camera parameter and the second camera parameter, the computer device performs depth hypothesis guidance by using the i depth image to obtain the i+1th layer first feature corresponding to the first image and the i+1th layer second feature corresponding to the second image, so as to generate the i+1th depth image corresponding to the first image based on the i+1th layer first feature and the i+1th layer second feature.

In one possible implementation, in order to improve accuracy of depth estimation while reducing the amount of computation, in generating the first depth map, the computer device may make as many depth hypotheses for each pixel point as possible, i.e., determine a plurality of depth hypothesis layers, so that when the first depth map is used to make depth hypothesis guidance for generating the second depth map, the computer device may determine more accurate depth values in relatively fewer depth hypothesis layers.

In one possible implementation, when performing depth assumption on each pixel point in the first sampled image of the first layer, an a-layer depth assumption layer may be determined; when depth assumption is performed on each pixel point in the first sampling image of the ith layer, a layer B depth assumption layer can be determined, wherein i is more than or equal to 2 and less than or equal to m, A is more than or equal to B, illustratively, each pixel point in the first sampling image of the first layer can correspond to 48 layers of depth assumption layers, and each pixel point in the first sampling image of the ith layer can correspond to 8 layers of depth assumption layers.

In an illustrative example, the computer device performs differential projection transformation on the first layer first optimized depth image feature and the first layer second optimized depth image feature along the first camera viewpoint direction based on the first camera parameter and the second camera parameter to obtain a first layer first feature and a second layer second feature, where the dimension of the first layer first feature may be 1/8×h×w×c×48, that is, the first layer first feature corresponds to a first layer first sampling image with 1/8 of the first image size, and each pixel point corresponds to 48 layers of depth hypothesis layers, and further the computer device performs cost measurement on the first layer first feature and the first layer second feature to obtain a first layer cost body, performs regularization processing on the first layer cost body, and performs regression along the depth direction to obtain a first layer probability body, so that the computer device determines depth values of each pixel point in the first layer first sampling image based on the first layer probability body to obtain the first depth map.

Further, in the process of performing differential projection transformation on the first optimized depth image feature of the second layer and the second optimized depth image feature of the second layer along the direction of the first camera viewpoint based on the first camera parameter and the second camera parameter, the first sampling image of the second layer is obtained by up-sampling the first sampling image of the first layer, so that in the process of performing depth assumption on each pixel point in the first sampling image of the second layer, the computer equipment can directly perform depth assumption on each pixel point in the first depth image according to the depth value of each pixel point in the first depth image, the first depth image of the second layer with the size of 1/4 first image, and obtain a first feature body of the second layer and a second feature body of the second layer, wherein the dimension of the first feature body of the second layer can be 1/4 XH XW XC X8 dimension, namely, the first cost image of each pixel point in the second layer corresponds to 8 layer depth assumption layer, further, the computer equipment can perform depth assumption on each pixel point in the first sampling image of the second layer with the size of the first layer, the second layer, namely, the second cost image of the second layer corresponds to the second cost image, and the second depth body of the second layer is obtained, the second cost image is further, the probability is calculated, and the probability of the second depth body in turn is obtained by performing depth computation on the second cost image of the second layer, and the second cost image is obtained.

In a possible implementation manner, in the process of performing depth hypothesis guidance on the i+1th layer first sampling image by using the i depth map, considering that a large error may exist between a depth value of each pixel point in the i depth map and a true depth value, especially a depth value corresponding to each pixel point at an image edge, in order to avoid causing an accumulated error in the depth hypothesis guidance process and improve accuracy of the depth value of each pixel point in the target depth map, the computer device may perform depth optimization on the i depth map before performing depth hypothesis guidance on the i+1th layer first sampling image by using the i depth map, obtain an optimized i depth map, and further perform depth hypothesis guidance on the i+1th layer first sampling image by using the optimized i depth map.

In a possible implementation manner, before the i+1th layer first sampling image is guided by using the i depth map, the computer device needs to perform up-sampling on the i depth map first, so that the size of the i depth map after up-sampling is the same as that of the i+1th layer first sampling image, further, feature extraction is performed on the i depth map after up-sampling to obtain a first image feature, meanwhile, the computer device also needs to perform feature extraction on the i+1th layer first sampling image to obtain a second image feature and image edge information of the i+1th layer first sampling image, and further, the computer device performs depth optimization on the i depth map after up-sampling based on the first image feature, the second image feature and the image edge information to obtain an optimized up-sampled i depth map.

Schematically, as shown in fig. 9, the computer device may perform depth optimization on the rough depth map 902 through the image edge information 901 of the first sampled image to obtain an optimized depth map 903, so that the computer device uses the optimized depth map 903 as a depth hypothesis guide to finally obtain a depth map corresponding to the first image, and performs three-dimensional reconstruction through the multi-view depth map to obtain a three-dimensional reconstruction result 904.

Alternatively, the computer device may perform depth optimization on the ith depth map based on the first image feature, the second image feature, and the image edge information via a DCT-Net network, which may be expressed as Wherein (1)>Represents the optimized ith depth map, phi ^D Representing a first image feature->Representing a second image feature->Representing image edge information, the detailed representation of the DCT (·, ·, ·) operator isWherein (1)>For a set of neural network-learnable parameters, L represents the Laplace filter, F and F ^-1 Representing the DCT forward and inverse respectively, I represents the identity matrix,% represents the pixel level division, and K is a set of 2D basis images.

In the above embodiment, after the feature extraction is performed on the first image and the second image through the feature pyramid network to obtain the m-layer first original depth image feature and the m-layer second original depth image feature, in order to perform feature optimization on the m-layer original depth image feature respectively, the m-layer first sampling image and the m-layer second sampling image are further obtained by resampling the first image and the second image, so that the m-layer first sampling image and the m-layer second sampling image are respectively processed to obtain the m-layer first non-local neighborhood perception feature and the m-layer second non-local neighborhood perception feature, and further the m Zhang Shendu map is obtained based on the m-layer first optimized depth image feature and the m-layer second optimized depth image feature, and in the process of generating the i+1 depth map, the i-th depth map is used for performing depth hypothesis guidance, thereby improving the depth estimation accuracy, reducing the calculated amount of depth estimation, and improving the depth estimation efficiency.

In addition, in the process of conducting depth hypothesis guidance by utilizing the ith depth map, in order to reduce accumulated depth errors caused by the rough depth map, feature extraction is conducted on the first sampling image of the (i+1) th layer to obtain image edge information, the ith depth map after upsampling is optimized by the image edge information to obtain an optimized depth map, and further depth hypothesis guidance is conducted on the optimized depth map, so that the efficiency and accuracy of depth estimation are further improved.

Referring to fig. 10, a flowchart of a depth map optimization method according to an exemplary embodiment of the present application is shown.

Firstly, the computer equipment performs feature extraction on a first image 1001 and a second image 1002 through a feature pyramid network 1003 to obtain m layers of original depth image features, under the condition that depth estimation is performed through the first layers of original depth image features, the computer equipment resamples the first image 1001 and the second image 1002 to obtain a first layer of first sampling image and a first layer of second sampling image with 1/8 of a first image size, and then performs non-local neighborhood matching on the first layer of first sampling image and the first layer of second sampling image through a non-local neighborhood matching network 1004 to obtain non-local neighborhood perception features 1005, so that the computer equipment performs feature optimization on the first layer of original depth image features through the non-local neighborhood perception features 1005 to obtain first layer of optimized depth image features.

And performing differential projection transformation on the optimized depth image features corresponding to the first layer first sampling image and the first layer second sampling image along the viewpoint direction of the first camera, namely performing depth assumption on each pixel point on the first layer first sampling image and the first layer second sampling image, thereby obtaining a first layer first feature 1006 and a first layer second feature 1007, performing three-dimensional non-local neighborhood matching on the first layer first feature 1006 through a three-dimensional non-local neighborhood matching network 1008, obtaining a first layer first optimized feature, performing cost measurement on the first layer first optimized feature and the first layer second feature through a measurement mode based on variance, performing regularization processing on the first layer cost body through a 3D CNN network 1010, performing regression on the depth direction to obtain a probability body, thereby determining depth values corresponding to each pixel point based on a WTA strategy, obtaining a first depth map 1011, further performing optimization on the first layer depth map through a depth computing device according to the first depth map 1001, and sequentially, and performing depth optimization on the first depth map 1009, thereby guiding the depth map through the first depth map 1001.

In addition, in the process of optimizing the multi-view depth estimation network by using the sample image, after obtaining the sample depth image of each layer, the computer device may determine a depth loss of each layer based on the sample depth image of each layer and the real depth image, and optimize the multi-view depth estimation network by using the depth loss to obtain a first depth image 1011, for example, the computer device may determine the first depth loss based on the first depth image 1011 and the first real depth image 1012, so as to optimize the multi-view depth estimation network by using the first depth loss.

In a possible implementation manner, the computer device performs depth estimation on images in a data set of danish university of technology (DTU) and a data set of Tanks and Temples by applying the depth map optimization method provided by the embodiment of the present application to obtain corresponding depth maps, and compares the quantitative results of generating the depth maps by the embodiment of the present application and the depth maps by the related technology, as shown in table 1.

TABLE 1

Please refer to fig. 11, which illustrates a depth map estimation result and a three-dimensional reconstruction result obtained by performing the depth map estimation and the three-dimensional reconstruction on the scene No. 11 and the scene No. 13 in the DTU data set by the depth map optimization method provided by the present application.

Referring to fig. 12, a block diagram of a depth map optimizing apparatus according to an exemplary embodiment of the present application is shown, where the apparatus includes:

the feature extraction module 1201 is configured to perform feature extraction on a first image and a second image corresponding to the first image, so as to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image, where the first image and the second image are images obtained by capturing the same three-dimensional reconstruction object under different viewing angles;

the neighborhood matching module 1202 is configured to perform non-local neighborhood matching on the first image and the second image respectively to obtain a first non-local neighborhood sensing feature corresponding to each pixel in the first image and a second non-local neighborhood sensing feature corresponding to each pixel in the second image, where the non-local neighborhood sensing feature includes pixel coordinates of at least two non-local neighborhood pixels matched with a current pixel in a non-local neighborhood range and an influence weight, and the influence weight characterizes a feature influence degree of the non-local neighborhood pixel on the current pixel, and the non-local neighborhood range is greater than an adjacent pixel range of the current pixel;

The feature optimization module 1203 is configured to perform feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and perform feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature, so as to obtain a first optimized depth image feature and a second optimized depth image feature;

the first image generating module 1204 is configured to generate a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature.

Optionally, the neighborhood matching module 1202 includes:

the channel merging unit is used for carrying out color channel weighted merging on the first image and the second image respectively to obtain a first weighted merged image and a second weighted merged image, wherein the first weighted merged image and the second weighted merged image are single-channel images;

the image sampling unit is used for downsampling the first weighted combined image and the second weighted combined image based on a sampling interval to obtain a first sampling image and a second sampling image;

the first neighborhood matching unit is used for determining a first non-local neighborhood range based on the image dimension corresponding to the first sampling image; based on the first non-local neighborhood range, performing non-local neighborhood matching on the first sampling image to obtain the first non-local neighborhood sensing characteristic;

The second neighborhood matching unit is used for determining a second non-local neighborhood range based on the image dimension corresponding to the second sampling image; and carrying out non-local neighborhood matching on the second sampling image based on the second non-local neighborhood range to obtain the second non-local neighborhood sensing characteristic.

Optionally, the first neighborhood matching unit is configured to:

based on the first non-local neighborhood range, determining a network through a first coordinate to obtain pixel point coordinates of non-local neighborhood pixel points corresponding to all pixel points in the first image;

obtaining influence weights of non-local neighborhood pixel points corresponding to each pixel point in the first image through a first weight determining network;

the second neighborhood matching unit is configured to:

based on the second non-local neighborhood range, determining a network through a first coordinate to obtain pixel point coordinates of non-local neighborhood pixel points corresponding to all pixel points in the second image;

and obtaining the influence weight of the non-local neighborhood pixel point corresponding to each pixel point in the second image through the first weight determining network.

Optionally, the feature optimization module 1203 is configured to:

performing image dimension matching on the first non-local neighborhood sensing feature based on the image dimension of the first original depth image feature to obtain a first non-local neighborhood sensing feature with the dimension matched, wherein the image dimension of the first non-local neighborhood sensing feature with the dimension matched is the same as the image dimension of the first original depth image feature;

Performing image dimension matching on the second non-local neighborhood sensing feature based on the image dimension of the second original depth image feature to obtain a second non-local neighborhood sensing feature with the dimension matched, wherein the image dimension of the second non-local neighborhood sensing feature with the dimension matched is the same as the image dimension of the second original depth image feature;

performing feature optimization on the first original depth image feature by using the first non-local neighborhood perception feature after dimension matching to obtain the first optimized depth image feature;

and performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature after dimension matching to obtain the second optimized depth image feature.

Optionally, the first image generating module 1204 includes:

the parameter acquisition unit is used for acquiring a first camera parameter corresponding to the first image, a first camera viewpoint and a second camera parameter corresponding to the second image;

the differential projection transformation unit is used for carrying out differential projection transformation on the first optimized depth image feature and the second optimized depth image feature along the first camera viewpoint direction based on the first camera parameter and the second camera parameter to obtain a first feature corresponding to the first image and a second feature corresponding to the second image, wherein the feature is a conical three-dimensional space along the first camera viewpoint direction and represents the image feature of each pixel point in different depth hypothesis layers;

The cost measurement unit is used for measuring the cost of the first feature body and the second feature body to obtain a cost body corresponding to the first image, wherein the cost body is a three-dimensional structure formed by connecting cost graphs in the depth direction, and each pixel point on the cost graphs represents the matching cost of the corresponding pixel point on the first image and the corresponding pixel point on the second image in the depth value;

the cost body processing unit is used for regularizing the cost body and regressing the cost body along the depth direction to obtain a probability body, wherein the probability body represents the depth probability of each pixel point in each depth hypothesis layer;

and the depth value determining unit is used for determining the depth value corresponding to each pixel point based on the probability body to obtain a depth map corresponding to the first image.

Optionally, the cost measurement unit is configured to:

carrying out non-local neighborhood matching on the first feature body to obtain a third non-local neighborhood sensing feature corresponding to each voxel in the first feature body, wherein the third non-local neighborhood sensing feature comprises voxel coordinates of at least two non-local neighborhood voxels matched with the current voxel and influence weights;

Performing feature optimization on the first feature body by using the third non-local neighborhood sensing feature to obtain a first optimized feature body;

and carrying out cost measurement on the first optimized feature body and the second feature body to obtain the cost body corresponding to the first image.

Optionally, the cost measurement unit is further configured to:

determining a third non-local neighborhood range corresponding to the first feature;

in the third non-local neighborhood range, obtaining the volume pixel coordinates of the non-local neighborhood volume pixels corresponding to each volume pixel in the first feature volume through a second coordinate determination network;

and obtaining the influence weight of the non-local neighborhood volume pixel corresponding to each volume pixel in the first feature volume through a second weight determining network.

Optionally, the depth value determining unit is configured to:

based on the probability body, determining a depth hypothesis layer with the maximum depth probability in each depth hypothesis layer corresponding to each pixel point as a target depth hypothesis layer; determining a depth value corresponding to the target depth hypothesis layer as a target depth value, and obtaining a depth map corresponding to the first image; or,

and weighting the depth probability and the depth value of each depth hypothesis layer corresponding to each pixel point based on the probability body, and determining the target depth value to obtain a depth map corresponding to the first image.

Optionally, the apparatus further includes:

the sample feature extraction module is used for extracting features of a first sample image and a second sample image corresponding to the first sample image through a feature pyramid network in the multi-view depth estimation network to obtain first sample original depth image features corresponding to the first sample image and second sample original depth image features corresponding to the second sample image, wherein the first sample image and the second sample image are images obtained by shooting the same sample three-dimensional reconstruction object under different view angles;

the sample neighborhood matching module is used for respectively carrying out non-local neighborhood matching on the first sample image and the second sample image through a non-local neighborhood sensing network in the multi-view depth estimation network to obtain first sample non-local neighborhood sensing characteristics corresponding to each sample pixel point in the first sample image and second sample non-local neighborhood sensing characteristics corresponding to each sample pixel point in the second sample image;

the sample feature optimization module is used for performing feature optimization on the first sample original depth image feature by using the first sample non-local neighborhood sensing feature and performing feature optimization on the second sample original depth image feature by using the second sample non-local neighborhood sensing feature through a feature optimization network in the multi-view depth estimation network to obtain a first sample optimized depth image feature and a second sample optimized depth image feature;

The sample image generation module is used for generating a sample depth image corresponding to the first sample image based on the first sample optimized depth image characteristic and the second sample optimized depth image characteristic through a depth image generation network in the multi-view depth estimation network;

the loss determination module is used for determining target depth loss based on a real depth map corresponding to the first sample image and the sample depth map;

and the network optimization module is used for optimizing the multi-view depth estimation network with the target depth loss.

Optionally, the loss determination module is configured to:

determining a first depth loss through cross entropy based on the depth distribution corresponding to the real depth map and the depth distribution corresponding to the sample depth map;

determining a second depth loss based on the depth distribution corresponding to the real depth map and the depth distribution corresponding to the sample depth map subjected to discrete cosine transform;

the network optimization module is used for:

the multi-view depth estimation network is optimized based on the first depth penalty and the second depth penalty.

Optionally, the feature extraction module 1201 is configured to:

Extracting features of the first image and the second image through a feature pyramid network to obtain m layers of first original depth image features corresponding to the first image and m layers of second original depth image features corresponding to the second image;

the neighborhood matching module 1202 is configured to:

respectively carrying out non-local neighborhood matching on the first image and the second image to obtain m layers of first non-local neighborhood sensing characteristics corresponding to each pixel point in the first image and m layers of second non-local neighborhood sensing characteristics corresponding to each pixel point in the second image;

the feature optimization module 1203 is configured to:

performing feature optimization on the first original depth image feature of the ith layer by using the first non-local neighborhood sensing feature of the ith layer, and performing feature optimization on the second original depth image feature of the ith layer by using the second non-local neighborhood sensing feature of the ith layer to obtain the first optimized depth image feature of the ith layer and the second optimized depth image feature of the ith layer, wherein i is more than or equal to 1 and less than or equal to m;

the first image generating module 1204 is configured to:

and generating an mth depth map corresponding to the first image based on the mth layer first optimized depth image feature and the mth layer second optimized depth image feature.

Optionally, the apparatus further includes:

the presumption guiding module is used for carrying out depth presumption guiding by utilizing the ith depth image in the process of carrying out differential projection transformation on the ith layer +1 first optimized depth image feature and the ith layer +1 second optimized depth image feature along the direction of the first camera viewpoint based on the first camera parameter and the second camera parameter to obtain the ith layer +1 first feature corresponding to the first image and the ith layer +1 second feature corresponding to the second image;

and the second image generation module is used for generating an i+1 depth map corresponding to the first image based on the i+1 th layer first feature and the i+1 th layer second feature.

Optionally, the hypothesis guidance module is configured to:

performing depth optimization on the ith depth map by using the first image to obtain an optimized ith depth map;

and performing depth hypothesis guidance by using the optimized ith depth map to obtain the first feature body of the (i+1) th layer and the second feature body of the (i+1) th layer.

Optionally, the hypothesis guidance module is further configured to:

upsampling the ith depth map, and extracting features of the upsampled ith depth map to obtain a first image feature of the ith depth map;

Extracting features of the first image to obtain second image features of the first image and image edge information;

and performing depth optimization on the ith depth map based on the first image feature, the second image feature and the image edge information to obtain the optimized ith depth map.

It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the method embodiments are described in the method embodiments, which are not repeated herein.

Referring to fig. 13, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the present application relates to a method for manufacturing a semiconductor device. The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system memory 1304 including a random access memory 1302 and a read only memory 1303, and a system bus 1305 connecting the system memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between the various devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (RAM, random Access Memory), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disk (CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1301 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1300 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1311 through a network interface unit 1312 coupled to the system bus 1305, or other types of networks or remote computer systems (not shown) may be coupled using the network interface unit 1312.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction is stored in the readable storage medium, and the at least one instruction is loaded and executed by a processor to realize the depth map optimization method described in the above embodiment.

Alternatively, the computer-readable storage medium may include: ROM, RAM, solid state disk (SSD, solid State Drives), or optical disk, etc. The RAM may include, among other things, resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory).

Embodiments of the present application provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the depth map optimization method described in the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but is intended to cover all modifications, equivalents, alternatives, and improvements falling within the spirit and principles of the application.

Claims

1. A depth map optimization method, the method comprising:

2. The method of claim 1, wherein the performing non-local neighborhood matching on the first image and the second image to obtain a first non-local neighborhood sensing feature corresponding to each pixel point in the first image and a second non-local neighborhood sensing feature corresponding to each pixel point in the second image includes:

Respectively carrying out color channel weighted combination on the first image and the second image to obtain a first weighted combined image and a second weighted combined image, wherein the first weighted combined image and the second weighted combined image are single-channel images;

downsampling the first weighted combined image and the second weighted combined image based on a sampling interval to obtain a first sampled image and a second sampled image;

determining a first non-local neighborhood range based on an image dimension corresponding to the first sampled image; based on the first non-local neighborhood range, performing non-local neighborhood matching on the first sampling image to obtain the first non-local neighborhood sensing characteristic;

determining a second non-local neighborhood range based on the image dimension corresponding to the second sampled image; and carrying out non-local neighborhood matching on the second sampling image based on the second non-local neighborhood range to obtain the second non-local neighborhood sensing characteristic.

3. The method according to claim 2, wherein performing non-local neighborhood matching on the first sampled image in the first non-local neighborhood range to obtain the first non-local neighborhood aware feature includes:

and performing non-local neighborhood matching on the second sampled image in the second non-local neighborhood range to obtain the second non-local neighborhood sensing characteristic, including:

4. The method of claim 1, wherein the performing feature optimization on the first original depth image feature using the first non-local neighborhood aware feature and performing feature optimization on the second original depth image feature using the second non-local neighborhood aware feature to obtain a first optimized depth image feature and a second optimized depth image feature comprises:

5. The method of claim 1, wherein the generating the depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature comprises:

Acquiring a first camera parameter and a first camera viewpoint corresponding to the first image and a second camera parameter corresponding to the second image;

based on the first camera parameters and the second camera parameters, performing differential projection transformation on the first optimized depth image feature and the second optimized depth image feature along the first camera viewpoint direction to obtain a first feature body corresponding to the first image and a second feature body corresponding to the second image, wherein the feature bodies are conical three-dimensional spaces along the first camera viewpoint direction, and represent image features of each pixel point in different depth hypothesis layers;

performing cost measurement on the first feature body and the second feature body to obtain a cost body corresponding to the first image, wherein the cost body is a three-dimensional structure formed by connecting cost graphs in the depth direction, and each pixel point on the cost graphs represents the matching cost of the corresponding pixel point on the first image and the corresponding pixel point on the second image in the depth value;

regularizing the cost body, and regressing along the depth direction to obtain a probability body, wherein the probability body represents the depth probability of each pixel point in each depth hypothesis layer;

And determining depth values corresponding to all pixel points based on the probability body to obtain a depth map corresponding to the first image.

6. The method of claim 5, wherein the performing cost measurement on the first feature and the second feature to obtain a cost volume corresponding to the first image includes:

7. The method of claim 6, wherein the performing non-local neighborhood matching on the first feature to obtain a third non-local neighborhood aware feature corresponding to each voxel in the first feature comprises:

8. The method of claim 5, wherein determining the depth value corresponding to each pixel based on the probability volume, and obtaining the depth map corresponding to the first image, comprises:

9. The method according to any one of claims 1 to 8, further comprising:

Extracting features of a first sample image and a second sample image corresponding to the first sample image through a feature pyramid network in a multi-view depth estimation network to obtain a first sample original depth image feature corresponding to the first sample image and a second sample original depth image feature corresponding to the second sample image, wherein the first sample image and the second sample image are images obtained by shooting the same sample three-dimensional reconstruction object under different view angles;

respectively carrying out non-local neighborhood matching on the first sample image and the second sample image through a non-local neighborhood sensing network in the multi-view depth estimation network to obtain first sample non-local neighborhood sensing characteristics corresponding to each sample pixel point in the first sample image and second sample non-local neighborhood sensing characteristics corresponding to each sample pixel point in the second sample image;

performing feature optimization on the first sample original depth image features by using the first sample non-local neighborhood sensing features and performing feature optimization on the second sample original depth image features by using the second sample non-local neighborhood sensing features through a feature optimization network in the multi-view depth estimation network to obtain first sample optimized depth image features and second sample optimized depth image features;

Generating a sample depth map corresponding to the first sample image based on the first sample optimized depth image feature and the second sample optimized depth image feature through a depth map generation network in the multi-view depth estimation network;

determining a target depth loss based on a real depth map corresponding to a first sample image and the sample depth map;

and optimizing the multi-view depth estimation network with the target depth loss.

10. The method of claim 9, wherein the determining the target depth loss based on the true depth map corresponding to the first sample image and the sample depth map comprises:

the optimizing the multi-view depth estimation network with the target depth penalty includes:

11. The method according to any one of claims 1 to 8, wherein the feature extraction of the first image and the second image corresponding to the first image to obtain a first original depth image feature corresponding to the first image and a second original depth image feature corresponding to the second image includes:

the non-local neighborhood matching is performed on the first image and the second image respectively to obtain a first non-local neighborhood sensing characteristic corresponding to each pixel point in the first image and a second non-local neighborhood sensing characteristic corresponding to each pixel point in the second image, including:

The performing feature optimization on the first original depth image feature by using the first non-local neighborhood sensing feature, and performing feature optimization on the second original depth image feature by using the second non-local neighborhood sensing feature to obtain a first optimized depth image feature and a second optimized depth image feature, including:

the generating a depth map corresponding to the first image based on the first optimized depth image feature and the second optimized depth image feature includes:

12. The method of claim 11, wherein the method further comprises:

in the process of performing differential projection transformation on the i+1th layer first optimized depth image feature and the i+1th layer second optimized depth image feature along the first camera viewpoint direction based on the first camera parameter and the second camera parameter, performing depth hypothesis guidance by utilizing the i depth image to obtain the i+1th layer first feature corresponding to the first image and the i+1th layer second feature corresponding to the second image;

And generating an i+1th depth map corresponding to the first image based on the i+1th layer first feature and the i+1th layer second feature.

13. The method of claim 12, wherein the performing depth hypothesis guidance using the i-th depth map to obtain the i+1th layer first feature corresponding to the first image and the i+1th layer second feature corresponding to the second image includes:

14. The method of claim 13, wherein performing depth optimization on the i-th depth map using the first image to obtain an optimized i-th depth map comprises:

15. A depth map optimizing apparatus, the apparatus comprising:

16. An electronic device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the depth map optimization method of any one of claims 1 to 14.

17. A computer readable storage medium storing at least one instruction for execution by a processor to implement the depth map optimization method of any one of claims 1 to 14.

18. A computer program product, the computer program product comprising computer instructions stored in a computer readable storage medium; a processor of an electronic device reads the computer instructions from the computer readable storage medium, the processor executing the computer instructions, causing the electronic device to implement the depth map optimization method of any one of claims 1 to 14.