CN118521601A

CN118521601A - Indoor scene 3D layout estimation method and device based on angular point depth prediction

Info

Publication number: CN118521601A
Application number: CN202410971204.3A
Authority: CN
Inventors: 张伟东; 李丽; 刘颖
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2024-07-19
Filing date: 2024-07-19
Publication date: 2024-08-20

Abstract

The embodiment of the disclosure provides an indoor scene 3D layout estimation method and device based on angular point depth prediction, which are applied to the technical field of three-dimensional layout of indoor scenes. The method comprises the steps of obtaining a target indoor scene image; inputting the target indoor scene image into a deep learning model, and outputting a predicted angular point thermodynamic diagram, an angular point depth value and an embedded vector; generating corner coordinates according to the corner thermodynamic diagram; clustering the embedded vectors to obtain a rough segmentation map, and performing expansion and corrosion operation on each plane area of the rough segmentation map to obtain a target area; determining a target corner according to the corner coordinates and the target area; performing plane fitting according to the corner depth values of the target corner points to obtain a target depth map corresponding to each plane; and carrying out depth map intersection calculation on the target depth map to obtain a layout depth map. The more accurate indoor layout depth map is obtained without being affected by position variation caused by cutting.

Description

Indoor scene 3D layout estimation method and device based on angular point depth prediction

Technical Field

The disclosure relates to the technical field of three-dimensional layout of indoor scenes, in particular to an indoor scene 3D layout estimation method and device based on angular point depth prediction.

Background

The three-dimensional layout estimation of the indoor scene is an important research topic in the field of computer vision, and has important significance for applications such as robot navigation, scene reconstruction, virtual reality and the like. The indoor layout estimation task is to locate and identify layout components (floors, ceilings, and walls) from one or more views and infer their three-dimensional geometry.

In the indoor layout estimation task, the traditional layout method mainly depends on bottom layer features, such as local colors, image edge information and the like, vanishing point detection is added, abnormal features are processed in an auxiliary mode in a post-processing mode, and finally candidate layouts are generated. For example, candidate layouts are generated by vanishing point detection and ray sampling of vanishing points to image edges, then an evaluation function is trained using low-level image features to evaluate the layouts, and finally the highest-score layout is selected as the best layout. Many subsequent researchers follow this whole process, improving the quality of the candidate layout, optimizing the efficiency of ray sampling, and ensuring the rationality of the selected evaluation function. However, because the image features adopted by the method are often low-level local features and lack semantic information and global expression capability, the accuracy of the method is not ideal, and the computational complexity is high.

Recent indoor layout estimation methods based on deep learning are mainly divided into two categories: semantic segmentation methods of ceilings, floors and walls and key point detection methods. For example, a Full Convolution Network (FCN) is used to predict the edge information of an indoor image, and a new vanishing line sampling method is proposed. Specifically, the FCN is used to predict the segmentation masks of the ceiling, floor and wall, then logistic regression is used to obtain the initial vanishing line position, then greedy iteration is performed, and the vanishing point position and vanishing line angle are continuously updated until the layout most conforming to the predicted edge map is obtained. For another example, a coarse-to-fine indoor layout estimation method. The method is divided into two stages, wherein the first stage predicts an edge map using FCNs, and the second stage obtains finer layouts through geometric constraints including layout contour straightness, surface smoothness and layout detail refinement. For another example, regarding the indoor layout estimation task as a key point positioning problem, an encoder-decoder structure is proposed, where the network predicts the room type and the location of the key point, respectively, and then uses the result of predicting the room type to screen the key corner heat map, thereby generating the final indoor layout. For another example, a room layout estimate is made using three segmentation hypotheses, first computing a segmentation based on each hypothesis, then predicting the layout defined by the 2D keypoints, and finally comparing the layout generated from the 2D keypoints with the corresponding image segmentation to select the one that best matches. The above method only allows to estimate the layout of the box house type. Recently, 3D layout estimation of indoor scenes using deep neural networks is increasingly active. For example, a depth map that infers a dominant plane in a scene is learned by predicting pixel-level surface parameters, and a layout is generated by intersection of the depth maps. And proposes a multi-branch network to detect planes and vertical lines between adjacent walls simultaneously, and 3D parameters of each plane, and then to implement a room layout using a geometric reasoning method.

In summary, with the benefit of deep learning, indoor layout estimation methods have made great progress, most of which recover the layout mainly by network learning of depth parameters or plane equations of the main planes. While there is a significant improvement in performance using existing methods, the depth parameter or plane equation for a plane in an image is actually determined by its appearance and position in the image, rather than being translationally invariant. It is difficult for a general convolution network to fully understand specific position information of each element in an image, so that difficulty is brought to an indoor layout estimation task. For example, the original is cropped twice, and the point P in the original is significantly different in position in the two images that have undergone different cropping, but their depth values are the same, which results in completely different output parameters. However, translational invariance is an important property in image processing tasks, as the position of objects in an image may change.

Therefore, it is necessary to design a way that the neural network can identify the content of the object without being affected by the position, so as to obtain a more accurate indoor layout depth map.

Disclosure of Invention

The disclosure provides an indoor scene 3D layout estimation method and device based on angular point depth prediction.

According to a first aspect of the present disclosure, an indoor scene 3D layout estimation method based on corner depth prediction is provided. The method comprises the following steps:

acquiring a target indoor scene image;

Inputting the target indoor scene image into a deep learning model, and outputting a predicted angular point thermodynamic diagram, an angular point depth value and an embedded vector;

generating corner coordinates according to the corner thermodynamic diagram;

Clustering the embedded vectors to obtain a rough segmentation map, and performing expansion and corrosion operation on each plane area of the rough segmentation map to obtain a target area;

Determining a target corner according to the corner coordinates and the target area, and performing plane fitting according to the corner depth values of the target corner to obtain a target depth map corresponding to each plane in a plurality of fitting planes;

And carrying out depth map intersection calculation on the target depth map by using a depth map intersection algorithm to obtain a layout depth map of the target indoor scene image.

Aspects and any one of the possible implementations as described above, further provides an implementation,

The corner points include: layout corner points and image corner points;

the layout corner points comprise layout boundary points and layout inner points; the layout boundary points are the intersection points of the two planes and the image boundary; the layout inner point is the intersection point of three planes.

The deep learning model is obtained through the following steps:

Acquiring a 3D layout image set subjected to manual annotation; the 3D layout image is marked with corner pixel coordinates and corresponding depth values;

inputting the 3D layout image set into a pre-constructed network model for training;

when the precision of the network model meets a preset threshold, stopping training to obtain a deep learning model;

Wherein the network model is constructed with three network branches.

Training of the network model is supervised using a bi-classification cross entropy loss function.

The generating the corner coordinates according to the corner thermodynamic diagram comprises the following steps:

and extracting pixel coordinates of the corner from the predicted Gaussian thermodynamic diagram of the corner by using a non-maximum suppression algorithm, and taking the pixel coordinates as the corner coordinates.

Clustering the embedded vectors to obtain a rough segmentation map, and performing expansion and corrosion operations on each plane area of the rough segmentation map to obtain a target area, wherein the method comprises the following steps:

Using a mean shift clustering algorithm to the embedded vector to generate a rough segmentation map;

Performing expansion and corrosion operations on each planar region of the rough segmentation map;

performing bit exclusive OR operation on the expanded image and the corroded image to obtain an intermediate area formed by the expanded boundary and the corroded boundary as a target area.

Determining a target angular point according to the angular point coordinates and the target area, and performing plane fitting according to angular point depth values of the target angular point to obtain a target depth map corresponding to each plane in a plurality of fitting planes, wherein the method comprises the following steps:

Taking the corner points falling in the middle area as target corner points, and carrying out least square method calculation according to the corresponding corner point depth values to obtain plane depth parameters;

and fitting to obtain target depth maps of a plurality of planes according to the plane depth parameters.

According to a second aspect of the present disclosure, there is provided an indoor scene 3D layout estimation apparatus based on corner depth prediction. The device comprises:

The data acquisition module is used for acquiring a target indoor scene image;

The model output module is used for inputting the target indoor scene image into a depth learning model and outputting a predicted angular point thermodynamic diagram, an angular point depth value and an embedded vector;

The coordinate determining module is used for generating corner coordinates according to the corner thermodynamic diagram;

The rough segmentation module is used for clustering the embedded vectors to obtain a rough segmentation map, and performing expansion and corrosion operation on each plane area of the rough segmentation map to obtain a target area;

The depth map generation module is used for determining a target angular point according to the angular point coordinates and the target area, and carrying out plane fitting according to the angular point depth values of the target angular point to obtain a target depth map corresponding to each plane in a plurality of fitting planes;

and the depth map generation module is also used for carrying out depth map intersection calculation on the target depth map by using a depth map intersection algorithm to obtain a layout depth map of the target indoor scene image.

According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method as described above when executing the program.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method according to the first aspect of the present disclosure.

According to the indoor scene 3D layout estimation method and device based on corner depth prediction, the semantic tags are expanded and corroded to obtain coplanar corner points, so that plane depth parameters of each plane are obtained, and the indoor layout is restored.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the disclosure thereto, the same or similar reference numerals denote the same or similar elements, wherein:

fig. 1 illustrates a flowchart of an indoor scene 3D layout estimation method based on corner depth prediction according to an embodiment of the present disclosure;

Fig. 2 shows a block diagram of an indoor scene 3D layout estimation apparatus based on corner depth prediction according to an embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure;

FIG. 4 shows an example plot of shortest distance of a real corner point to a fitting plane according to an embodiment of the present disclosure;

FIG. 5 illustrates an overall flow diagram for predicting corner points with depth to recover indoor layout according to an embodiment of the present disclosure;

Fig. 6 shows an exemplary schematic diagram where different types of corner points have the shortest distance to adjacent planes according to an embodiment of the present disclosure;

FIG. 7 shows a qualitative result comparison schematic of Matterport D-Layout dataset according to an embodiment of the present disclosure; wherein the input image is shown in (a), the predicted corner with depth is shown in (b), the two predicted embedding vectors are shown in (c) via clustering obtained segmentation masks, the estimated layout depth map is shown in (D), and (e), (f), (g) and (h) show the 2D layout of the present disclosure, respectively, and the 2D layout corresponding to other methods;

FIG. 8 shows a qualitative result comparison schematic of LSUN datasets in accordance with an embodiment of the present disclosure; wherein the input image is shown in (a), the predicted belt depth corner is shown in (b), the two predicted embedding vectors are shown in (c) via clustering obtained segmentation masks, the estimated layout depth map is shown in (D), and (e), (f) respectively show the 2D layout of the present disclosure, and the 2D layout corresponding to other methods;

FIG. 9 shows a schematic diagram of test results after different cuts are made to the same graph according to an embodiment of the present disclosure; wherein, (a) is the image of the original image after five different cuts, (b) is the test result of GeoLayout methods, and (c) is the test result of the methods of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

In the disclosure, as shown in fig. 5, predicting a corner with depth to recover an overall flow diagram of indoor layout, the position information and depth information of the layout corner and the image corner in the predicted image are proposed. Meanwhile, two embedded vectors are additionally predicted, a rough layout segmentation diagram is obtained through clustering, predicted points are classified through expansion and corrosion operation, and accordingly plane depth parameters of each plane are obtained, and then indoor layout is restored.

First, a method of restoring an indoor layout by learning corner points with depth, which does not depend on location context information. A method based on planar region expansion, corrosion and exclusive or is proposed to find co-planar corner points.

Secondly, a geometric constraint of a point-to-face distance is provided, so that the learning of the angular point depth can be effectively enhanced.

Thirdly, experiments were performed on dataset Matterport D-Layout and two 2D reference datasets, verifying the validity and robustness of the method.

Fig. 1 shows a flowchart of an indoor scene 3D layout estimation method 100 based on corner depth prediction according to an embodiment of the present disclosure. The method 100 comprises the following steps:

Step 110, a target indoor scene image is acquired.

In some embodiments, an indoor scene image is acquired that requires an indoor scene 3D layout estimation.

In some embodiments, in an indoor scene, layout corners (intersections between walls) are salient feature points in the indoor layout, providing important information to the indoor layout estimation task. The exemplary schematic diagram of the shortest distance between different types of corner points and adjacent planes as shown in fig. 6, wherein the corner points are divided into three classes, including a layout interior point (the intersection of three planes), a layout boundary point (the intersection of two planes and an image boundary), and four image corner points. For example, a layout boundary point, which is the junction of two walls, is the shortest distance from the wall surface labeled 1, 2. Layout interior points, which are the junction points of three walls, are shortest to the wall faces labeled 1, 3, 4. The corner point of the image is the junction of a wall and the boundary of the image, so that the distance from the corner point of the image to the wall surface with the label of 4 is shortest. According to this definition, it is ensured that at least three points with depth can be found in any one plane, so that its plane depth parameters can be fitted.

And 120, inputting the target indoor scene image into a deep learning model, and outputting a predicted corner thermodynamic diagram, a corner depth value and an embedded vector.

In some embodiments, the deep learning model is obtained by: acquiring a 3D layout image set subjected to manual annotation; the 3D layout image is marked with corner pixel coordinates and corresponding depth values; inputting the 3D layout image set into a pre-constructed network model for training; when the precision of the network model meets a preset threshold, stopping training to obtain a deep learning model; wherein, the network model adopts swin-transducer as encoder, the real window FC-CRFs module as decoder, and uses a Pyramid Pooling Module (PPM) to gather the information of the whole image. Finally, a predicted depth map is obtained by a convolution and upsampling operation. Wherein the output part of the network model is divided into 3 output branches, i.e. three network branches. The first branch is used for predicting the position of the corner point; the second branch is responsible for predicting the depth of the corner; the third branch is used to predict the embedded vector. Considering that the four image corner positions are fixed, only the positions of layout corners need to be learned.

The number of output channels of the first branch is adjusted to be 2, gaussian thermodynamic diagrams representing layout boundary points and layout inner points in the image are respectively output, gaussian blur is added to the diagram only comprising the real layout boundary points and the layout inner points, and a two-class cross entropy loss function is usedAnd (5) performing supervision training. Since the number of positive and negative samples differ too much, a larger loss weight is given to the positive sample, and the specific formula is as follows:

Wherein, AndRespectively represent the first in the predicted position thermodynamic diagramThe probability that each pixel point is a layout boundary point and a layout inner point; And (3) with Is the true layout boundary point and the position of the internal point in the layoutA plurality of pixel points; And The i-th pixel point is the weight of the layout boundary point and the layout inner point; n is the total number of pixel points in the predicted position thermodynamic diagram; in order to lay out the loss of boundary points, Is the loss of points within the layout.

The number of output channels of the second output branch is 1, which is used for predicting the depth value of the corner point. And only giving loss weight to the corner position added with Gaussian blur, and adopting L1loss to supervise the learning of the depth value at the corner. The specific loss function is as follows:

where Lz is the corner loss, N is the total number of corners, pi is the corner probability, E represents a position map of four image corner points, i.e. the pixel values at the positions of the four image corner points are 1 and the pixel values at the remaining points are 0.For the depth value predicted at the i pixel,Representing the true depth value at the i pixel.

In addition, the scale invariant log SILog loss is used as a supplement to guide the model to learn the depth values at the corner points. It should be noted that the depth values of the whole image are not directly weighted, but only the layout corner and the image corner positions and the surrounding points. Firstly, calculating the logarithmic difference between the depth map of the predicted point and the depth of the real point:

Wherein, Representing a matrix of 0 and 1, being the values of the elements in the matrix,When the pixel value of the i-th pixel point is 1, it means that the logarithmic difference calculation is performed on the depth value at the pixel point. The scale invariant penalty is then calculated for K pixels in the image with valid depth values as:

wherein Ls is the scale invariant loss, Is the variance minimization factor of the variance of the signal,The scale constant is set to be 10,Set to 0.85, k is the number of pixels with an effective depth value.

In addition, in general, the distance from a layout boundary point to two adjacent surfaces is shortest and 0, the distance from a layout interior point to three adjacent surfaces is shortest and 0, and the distance from an image corner point to one adjacent surface is shortest and 0. Thus putting a geometric constraint on the point-to-face distanceThe specific formula of the learning of the supervised corner depth value is as follows:

wherein the method comprises the steps of ，，For the distance from the intersection point of three faces to each face, o is the serial number of the intersection point, i is one of the three intersection points before the selected distance,For the distance between the intersection point of two faces and the image boundary and each face, r is the serial number of the intersection point, j is one of the two intersection points before the selected distance,For the distance from the corner point of the image to each surface, e is the sequence number of the corner point, n is the number of the real three-surface intersection points, and m is the number of the real two-surface and image boundary intersection points.

For example, as shown in the example diagram of the shortest distance of a real corner to a fitting plane in fig. 4, different types of corners find the fitting plane closest thereto in the (u, v, 1/z) coordinate system.

Therefore, the total depth constraint is:

Wherein the third branch is used to learn two embedded vectors, so the number of output channels is set to 2. All pixel points are mapped to the embedding space, the pixel embedding belonging to the same plane should be as close as possible, and the embedding of different planes should be as far as possible. And uses discrimination loss Pixel embedding for punishing pixel embedding far from each other on the same plane and pixel embedding close to each other on different planes:

Wherein C is the number of planes in the true value, n _c is the number of pixels in plane C, X _i is the pixel embedding, X ^c is the average embedding of plane C, δv and δd are the thresholds of pull loss and push loss, respectively, α is the weight of pull loss, β is the weight of push loss, a and B represent the corresponding output channels, X ^cA is the average embedding of plane C output by the a output channel, and X ^cB is the average embedding of plane C output by the a output channel.

Finally, the total training loss is:

and 130, generating corner coordinates according to the corner thermodynamic diagram.

In some embodiments, the step 130 includes: and extracting corner pixel coordinates from the Gaussian thermodynamic diagram of the predicted corner by using a non-maximum suppression algorithm to serve as the corner coordinates.

In some embodiments, a non-maximum suppression algorithm NMS is used to extract specific pixel coordinates from the predicted gaussian thermodynamic diagram of layout boundary points and gaussian thermodynamic diagram of points within the layout, and then find the depth value corresponding to the coordinates in the predicted depth map. To this end we have obtained predicted corner points with depth.

And 140, clustering the embedded vectors to obtain a rough segmentation map, and performing expansion and corrosion operation on each plane area of the rough segmentation map to obtain a target area.

In some embodiments, step 140 comprises: using a mean shift clustering algorithm to the embedded vector to generate a rough segmentation map; performing expansion and corrosion operation on each plane area of the rough segmentation map; performing bit exclusive OR operation on the expanded image and the corroded image to obtain an intermediate area formed by the expanded boundary and the corroded boundary as a target area. For example, a mean shift clustering algorithm is used to generate a rough layout segmentation result with the two embedded vectors predicted. Subsequently, each plane is sequentially taken out according to a rough layout division diagram, and the plane is inflated and eroded using a structural element having a square shape and a size of 10.

And 150, determining a target angular point according to the angular point coordinates and the target area, and performing plane fitting according to the angular point depth values of the target angular point to obtain a target depth map corresponding to each plane in a plurality of fitting planes.

In some embodiments, step 150 comprises: and taking the corner points falling in the middle area as points on the corresponding fitting plane, and fitting according to the corresponding depth values to obtain a target depth map of the corresponding plane. I.e. when the predicted boundary point, interior point and four image corner points fall within the intermediate region, they are classified as points on the plane, i.e. target corner points. And then, fitting pixel coordinates of the three corner points on the same plane and depth values of the corner points corresponding to the pixel coordinates to obtain all planes. The pixel coordinates of the corner points and the predicted depth values corresponding to the pixel coordinates are fitted by using a least square method, so that a fitted plane is obtained. Wherein a point on one 3D plane can be projected onto the image plane and the following equation is satisfied:

using the equation to sequentially perform least square fitting on points classified into the same plane to obtain parameters of each plane (U, v, 1/Z) represents a coordinate system.

And 160, performing depth map intersection calculation on the target depth map by using a depth map intersection algorithm to obtain a layout depth map of the target indoor scene image.

In some embodiments, the depth map of each face obtained by step 150 is used and the layout is generated by a depth map intersection algorithm, i.e., the planar layout is obtained by planar depth maps intersecting each other.

In some embodiments, based on Hedau and LSUN, two reference datasets commonly used in current layout estimation tasks, both rely on true two-dimensional segmentation to describe scene layout and lack true depth information. Then, to apply the above method on these datasets, a pre-training is first performed on the Matterport D-Layout dataset, followed by fine tuning of the pre-training model on the reference dataset. And then, the network structure is used for learning the angular points with depth and learning the two embedded vectors. L _cor is used to constrain the predicted corner positions, and L _embed constrains the predicted embedded vectors.

Meanwhile, the layout types in Hedau and LSUN data sets are cuboid, so that the 3D layout of the cubic scene can be obtained by finding the minimum depth value of all the main plane depth maps:

where Zi is a layout depth value, f is a plane sequence number, and i is a pixel point in a corresponding plane.

Since the 2D coordinates of the real layout segmentation and corner points are provided in LSUN, the layout corner points and the four image corner points are first classified onto the respective belonging planes. And then, carrying out least square fitting on the real pixel coordinates of all the points belonging to the same plane and the predicted depth values corresponding to the points to obtain plane parameters. Then, a depth map is calculated, and a minimum value is extracted at each pixel, and the generated depth map is used as a pseudo tag to supervise the depth value learning at the layout corner points and the four image corner points. The loss function of the predicted depth map is defined as follows:

In addition, the geometric constraint L _P-P of the above-mentioned point-to-plane distance is added.

In addition, since the true depth value of the layout plane is generally within a certain range, the following loss is given to supervise learning to prevent the depth value from being too high or too low:

Wherein, As a lower bound of the depth value,Is the upper bound of the depth value.

However, by virtue of the two losses described above for supervised training, the learning model may be more prone to having each plane produce a closer depth value in order to minimize the losses. Thus, the use of cross entropy loss allows a large gap in depth values under each label.

Wherein the method comprises the steps ofIs a scaling factor, F is the number of planes,Is the ith pixel of the stitched depth map from the 2D real segmentation map. The total loss is:

Thus, three datasets Matterport D-Layout, LSUN, and Hedau can be used for training and testing in the manner described above.

The above procedure of the present application is described in detail below in connection with three datasets Matterport D-Layout, LSUN and Hedau:

First, the sizes of RGB images of an input indoor scene are all adjusted to 224×224 using bilinear interpolation, and the three output branches of the network have output sizes of 112×112×2, 112×112×1, and 112×112×2, respectively. The training data is enhanced using data enhancement means of color illumination and color dithering. The network trains 200 epochs using pytorch environments on a server of an Nvidia RTX4090 GPU. Using Adam as the optimizer, the batch size was set to 16, the initial learning rate was 10-4, and the weight decay was 10-4. During training, the learning rate was reduced by 30% per 50 training rounds, and the super parameters in the experiment were set as follows: ，，。

Regarding the data set: matterport3 the 3D-Layout dataset is obtained by combining manual labeling and plane fitting algorithms and is specially used for 3D Layout estimation research, and real 3D Layout information is provided. In addition, pixel coordinates of layout corner points and depth values corresponding to the corner points are also provided, so that the method is very suitable for learning the depth information corresponding to the corner points. The dataset comprises indoor image compositions from 90 different buildings, exhibiting good layout diversity. The training set comprises 64 buildings, and contains 4939 images in total; the validation set included 6 buildings, totaling 456 images. The test set contained the remaining 20 buildings, for a total of 1965 images.

The LSUN dataset contained 4000 training images, 394 verification images and 1000 test images. Including the pixel coordinates of the real 2D layout segmentation mask and layout corner points.

Hedau datasets consisted of 209 training images and 105 test images.

Regarding the evaluation index: the evaluation criteria adopted for Matterport D-Layout dataset are concentrated on the Layout depth map, and the adopted evaluation criteria comprise mean square error (rms), average relative error (rel), average log10 error (log 10), and threshold precision3D corner errors (representing euclidean distance between the 3D layout angle in the camera coordinate system and ground truth). For LSUN and Hedau datasets, metrics for evaluating the proposed model include pixel segmentation errorsAngular point position error. Wherein pixel error refers to the accuracy of each pixel surface label between true and estimated; the corner error is the euclidean distance between the real corner and the estimated corner normalized by the image diagonal length.

1) Results on Matterport3D-Layout dataset:

The network is first trained on Matterport D-Layout training sets using the real Layout segmentation map and the real deep corner points as supervisory signals. Our method was evaluated on the test set and compared with other advanced methods.

Quantitative results: the following table lists quantitative comparison results, and it can be observed that the method exhibits the best performance both on the 2D layout estimation index and the 3D layout estimation index:

Meanwhile, to reveal the performance of the new point-to-face distance geometric constraint (L _P-P) proposed by the present disclosure, ablation experiments were performed on the Matterport3D-Layout dataset, as shown in the following table:

That is, comparing the results before and after adding L _P-P, it can be found that the model (GeoLayout +L _P-P) after adding L _P-P is significantly improved in most indexes, wherein the improvement on 2D indexes is larger. "Ours (w/o L _P-P)" means that no model was added to L _P-P, and "Ours (w/L _P-P)" means that a model was added to L _P-P. "Ours (w/L _P-P)" was found to perform best.

Qualitative results: a qualitative result comparison of the Matterport D-Layout dataset is shown in fig. 7, where the selected room types include three cuboid rooms and three non-cuboid rooms. The input image is shown in (a) and the predicted corner points with depth are shown in (b). In (c) a segmentation mask obtained by clustering the predicted two embedded vectors is shown. The estimated layout depth map is shown in (d). (e) (f), (g) and (h) show the 2D layout of the present disclosure, and the 2D layout corresponding to other methods, respectively. It can be seen that the present disclosure yields good layout estimation results on both cuboid and non-cuboid rooms. Experimental results show that the public method can obtain more accurate three-dimensional layout results.

2) Results on LSUN dataset and Hedau dataset:

To evaluate the generalization ability of the disclosed model, the model trained on the Matterport D-Layout dataset was evaluated directly on the LSUN validation set and the Hedau test set, which process did not require additional training. The w/o Fine-tune in the following table represents the results of the direct test:

From the results, it can be seen that the disclosed model still produces more reliable results with good generalization even without additional training. Next, fine-tuning is performed on LSUN datasets, which are divided into 2D layout estimates and 3D layout estimates. It has also been found that performance is still significantly improved despite LSUN data sets not containing a true layout depth map, and better performance is still achieved on LSUN data sets than other methods. Finally, the trimmed model on LSUN dataset is tested on Hedau test set, and the result is superior to other methods in the table, and the effectiveness of the method is also proved.

Qualitative results: a qualitative result comparison of the LSUN dataset is shown in fig. 8, wherein the selected room types include three cuboid rooms and three non-cuboid rooms. The input image is shown in (a) and the predicted corner points with depth are shown in (b). In (c) a segmentation mask obtained by clustering the predicted two embedded vectors is shown. The estimated layout depth map is shown in (d). (e) (f) show the 2D layout of the present disclosure, and the 2D layout corresponding to other methods, respectively. It can be seen that performance is still significantly improved, although LSUN datasets do not contain true layout depth maps.

3) Stability test:

To test the position in the independent image of the disclosed method, the following experiment was performed. Each picture in the Matterport D-Layout test set is cut in five different ways, the method disclosed by the disclosure and the method disclosed by GeoLayout are used for testing respectively, and standard deviation is calculated for the error of each picture, namely, the test result schematic diagrams after different cutting are carried out on the same picture as shown in fig. 9, and two groups are totally illustrated, wherein (a) is an image after five different cutting carried out on an original image, (b) is a test result of the method disclosed by the disclosure, and (c) is a test result of the method disclosed by GeoLayout. Since the real corner coordinates of the cropped picture are not known, the 2D index only considers pixel errors. The results are shown in the following table:

it can be found that the standard deviation of each error of the methods of the present disclosure is smaller than GeoLayout, indicating that the methods of the present disclosure have greater stability. When the original image is used for testing, the method and GeoLayout can obtain better 2D layout. However, significant prediction errors occur at GeoLayout when clipping is performed, and the result produced by the method of the present disclosure after different clipping is more similar to the true value. Experimental results show that compared with a method for directly learning plane depth parameters, the model disclosed by the invention has stronger stability and better front-back consistency.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.

Fig. 2 shows a block diagram of an indoor scene 3D layout estimation apparatus 200 based on corner depth prediction according to an embodiment of the present disclosure. As shown in fig. 2, the apparatus 200 includes:

a data acquisition module 210, configured to acquire a target indoor scene image;

the model output module 220 is configured to input the target indoor scene image into a deep learning model, and output a predicted corner thermodynamic diagram, a corner depth value and an embedded vector;

The coordinate determining module 230 is configured to generate corner coordinates according to the corner thermodynamic diagram;

The rough segmentation module 240 is configured to cluster the embedded vectors to obtain a rough segmentation map, and perform expansion and corrosion operations on each planar region of the rough segmentation map to obtain a target region;

the depth map generating module 250 is configured to determine a target corner according to the corner coordinates and the target area, and perform plane fitting according to a corner depth value of the target corner, so as to obtain a target depth map corresponding to each plane in a plurality of fitting planes;

the depth map generating module 250 is further configured to perform depth map intersection calculation on the target depth map by using a depth map intersection algorithm, so as to obtain a layout depth map of the target indoor scene image.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

According to an embodiment of the disclosure, the disclosure further provides an electronic device, a readable storage medium.

Fig. 3 shows a schematic block diagram of an electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

The electronic device 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a ROM302 or a computer program loaded from a storage unit 308 into a RAM 303. In the RAM303, various programs and data required for the operation of the electronic device 300 may also be stored. The computing unit 301, the ROM302, and the RAM303 are connected to each other by a bus 304. I/O interface 305 is also connected to bus 304.

Various components in the electronic device 300 are connected to the I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the electronic device 300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 301 performs the various methods and processes described above, such as method 100. For example, in some embodiments, the method 100 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 300 via the ROM302 and/or the communication unit 309. One or more of the steps of the method 100 described above may be performed when the computer program is loaded into RAM303 and executed by the computing unit 301. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the method 100 by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An indoor scene 3D layout estimation method based on corner depth prediction is characterized by comprising the following steps:

acquiring a target indoor scene image;

generating corner coordinates according to the corner thermodynamic diagram;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The corner points include: layout corner points and image corner points;

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The deep learning model is obtained through the following steps:

Wherein the network model is constructed with three network branches.

4. The method of claim 3, wherein the step of,

5. The method of claim 1, wherein the step of determining the position of the substrate comprises,

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

7. The method of claim 6, wherein the step of providing the first layer comprises,

8. An indoor scene 3D layout estimation device based on corner depth prediction, which is characterized by comprising:

The data acquisition module is used for acquiring a target indoor scene image;

9. An electronic device, comprising:

At least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.