RU2825722C1

RU2825722C1 - Visualization of reconstruction of 3d scene using semantic regularization of normals tsdf when training neural network

Info

Publication number: RU2825722C1
Application number: RU2023122047A
Authority: RU
Inventors: Анна Ильинична СОКОЛОВА; Анна Борисовна ВОРОНЦОВА; Александр Георгиевич ЛИМОНОВ
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2023-08-24
Publication date: 2024-08-28

Abstract

FIELD: image processing.

SUBSTANCE: method for reconstruction of 3D scene and its visualization using a neural network consisting of a basic neural network, including a skeleton, which is the main part of the basic neural network, which calculates features, a TSDF head, which performs prediction of TSDF values in each voxel, and a segmentation head, which is 3D sparse convolutional segmentation module, which predicts the segmentation marks in each voxel, comprising steps of: basic neural network is trained to obtain TSDF volume for scene voxels as follows, training data is input, TSDF loss between TSDF prediction is calculated, calculating the segmentation loss between the segmentation prediction and the segmentation reference, calculating the coordinates of the normals, calculating the total loss function as the total loss function TSDF, minimizing the total loss function, using a trained basic neural network to obtain a TSDF volume of an input sequence of RGB frames of a real scene, applying an algorithm to the TSDF volume, which calculates reconstruction of 3D scene, 3D scene reconstruction is rendered to obtain 3D scene visualization.

EFFECT: high accuracy of reconstructing 3D scene.

11 cl, 2 dwg, 3 tbl

Description

Область техники, к которой относится изобретениеField of technology to which the invention relates

Настоящее изобретение может найти применение в компьютерном зрении для создания реконструкции 3D-сцены и визуализации результатов этой реконструкции 3D-сцены, которые могут быть использованы для визуальной локализации, визуальной аналитики содержимого сцены, в частности, для обнаружения и сегментации сцены в 3D-пространстве, оценки плана помещения, а также для визуализации внутреннего пространства помещений по созданной 3D-модели, и для решения других задачи.The present invention can find application in computer vision for creating a 3D scene reconstruction and visualizing the results of this 3D scene reconstruction, which can be used for visual localization, visual analysis of scene contents, in particular, for detecting and segmenting a scene in 3D space, assessing a room plan, as well as for visualizing the interior space of rooms based on the created 3D model, and for solving other problems.

Описание известного уровня техникиDescription of the prior art

3D-реконструкция является основной задачей компьютерного зрения, применяющейся в таких областях, как робототехника и AR/VR. Эти сценарии применения требуют тонко детализированных, правдоподобных и плотных реконструкций.3D reconstruction is a core task in computer vision, applied in areas such as robotics and AR/VR. These application scenarios require finely detailed, plausible, and dense reconstructions.

Однако получение таких реконструкций все еще представляет проблему для современных методов реконструкции из-за несовершенства входных данных: неполноты наблюдений в виде закрытых и невидимых областей, межкадровой несогласованности и неизбежных ошибок измерений. Поэтому возможно возникновение артефактов реконструкции. Глобальные артефакты включают в себя поврежденную структуру сцены и некорректное разделение; столь серьезные недостатки делают реконструированную сцену практически бесполезной.However, obtaining such reconstructions still poses a challenge for modern reconstruction methods due to the imperfection of the input data: incomplete observations in the form of occluded and invisible regions, interframe inconsistency, and inevitable measurement errors. Therefore, reconstruction artifacts may occur. Global artifacts include corrupted scene structure and incorrect partitioning; such serious flaws make the reconstructed scene practically useless.

Кроме того, может возникать множество локальных артефактов: например, дублированные поверхности, которые не удалось наложить друг на друга после замыкания цикла, дыры в закрытых и невидимых областях, неплоские стены и пол, покрытые углублениями и возвышениями. Хотя такие артефакты не столь драматичны, они все же накладывают ограничения, например, на сценарии навигации, в которых требуется точная оценка границ помещения.In addition, many local artifacts can occur: for example, duplicated surfaces that could not be superimposed after loop closure, holes in closed and invisible areas, non-flat walls and floors covered with depressions and elevations. Although such artifacts are not so dramatic, they still impose limitations, for example, on navigation scenarios that require precise estimation of room boundaries.

Плотная 3D-реконструкция по изображениям RGB традиционно подразумевает оценку карт глубины для изображений RGB и их слияние, что является еще одним потенциальным источником ошибок. В недавних методах 3D-реконструкции делается попытка минимизировать этот нежелательный эффект с помощью прямого предсказания TSDF (Truncated Signed distance function, функции усеченного расстояния со знаком), с помощью которой удобно устанавливать сцену в трехмерном пространстве; использование этой функции с применением алгоритма марширующих кубов позволяет восстановить сетку или облако точек, соответствующее данной сцене, см., например, https://en.wikipedia.org/wiki/Signed_distance_function). Такие методы извлекают признаки изображения с помощью 2D CNN, проецируют их обратно в трехмерное пространство, агрегируют их и предсказывают окончательный TSDF объем с помощью 3D CNN. Хотя это и эффективный способ для совместного рассмотрения всех входных изображений, использование TSDF не решает проблем с глобальной структурой сцены и плоскими поверхностями.Dense 3D reconstruction from RGB images traditionally involves estimating depth maps for RGB images and merging them, which is another potential source of error. Recent 3D reconstruction methods attempt to minimize this undesirable effect by directly predicting the Truncated Signed distance function (TSDF), which conveniently fits the scene in 3D space; using this function with the Marching Cubes algorithm allows reconstructing a mesh or point cloud corresponding to the given scene, see e.g., https://en.wikipedia.org/wiki/Signed_distance_function). Such methods extract image features using a 2D CNN, project them back to 3D space, aggregate them, and predict the final TSDF volume using a 3D CNN. Although this is an effective way to consider all input images together, using TSDF does not address the issues of global scene structure and flat surfaces.

В подавляющем большинстве помещений стены плоские и вертикальные, а пол плоский и горизонтальный. Эти базовые, неограничивающие знания о геометрии сцены в помещении могут быть использованы в процессе реконструкции.In the vast majority of rooms, the walls are flat and vertical, and the floor is flat and horizontal. This basic, non-limiting knowledge of the geometry of the indoor scene can be used in the reconstruction process.

Недавно были представлены методы, позволяющие реконструировать 3D сцену путем оценки TSDF объема непосредственно по изображениям. TSDF объем - это эффективный способ совместного рассмотрения всех входных изображений, однако объединение признаков с нескольких ракурсов остается сложной задачей. Соответственно, прогресс в этой области методов реконструкции TSDF в основном связан с разработкой стратегий агрегирования признаков.Recently, methods have been presented that can reconstruct a 3D scene by estimating the TSDF volume directly from images. The TSDF volume is an effective way to consider all input images together, but combining features from multiple views remains a challenging task. Accordingly, progress in this area of TSDF reconstruction methods is mainly related to the development of feature aggregation strategies.

Atlas [2] впервые реализовал модель 3D-реконструкции от начала до конца, в которой признаки изображения проецируются обратно и накапливаются в воксельном объеме, который затем передается в 3D CNN, прогнозирующую TSDF объем (функции усеченного расстояния со знаком). Функция расстояния со знаком (SDF) - это ортогональное расстояние данной точки х до границы множества Ω в метрическом пространстве, где знак определяется тем, находится ли х внутри Ω или нет. SDF является стандартным способом кодирования 3D-пространства (расстояния между точками пространства и рассматриваемым объектом). TSDF - это модификация SDF, в которой значения больших расстояний усекаются для ускорения преобразования алгоритма 3D-реконструкции. Функции расстояния со знаком обычно используются для 3D реконструкции и неоднократно появлялись в предшествующем уровне техники, например, в работе "A volumetric method for building complex models from range images", SIGGRAPH, 1996.Atlas [2] first implemented an end-to-end 3D reconstruction model in which image features are back-projected and accumulated in a voxel-based volume, which is then fed to a 3D CNN predicting a TSDF (Truncated Signed Distance Function) volume. A signed distance function (SDF) is the orthogonal distance of a given point x to the boundary of a set Ω in a metric space, where the sign is determined by whether x is inside Ω or not. SDF is a standard way of encoding 3D space (the distance between points in space and the object in question). TSDF is a modification of SDF in which large distance values are truncated to speed up the conversion of the 3D reconstruction algorithm. Signed distance functions are commonly used for 3D reconstruction and have appeared several times in the prior art, for example in "A volumetric method for building complex models from range images", SIGGRAPH, 1996.

Благодаря своей простоте TSDF объем является широко используемым представлением для 3D-реконструкций, а также в современных подходах к реконструкции. Atlas использует явно недостаточно оптимальную стратегию усреднения для объединения признаков изображения, которая пересматривается в следующих подходах. NeuralRecon [3] использует стратегию иерархического объединения: в ней признаки соседних видов усредняются и объединяются по кластерам видов с использованием сети RNN.Due to its simplicity, the TSDF volume is a widely used representation for 3D reconstructions and is also used in modern reconstruction approaches. Atlas uses a clearly suboptimal averaging strategy for image feature fusion, which is revisited in subsequent approaches. NeuralRecon [3] uses a hierarchical fusion strategy: it averages features from neighboring views and fusions them across view clusters using an RNN network.

NeuralRecon демонстрирует производительность в режиме реального времени на последовательных вводах. VoRTX [1] преодолевает ограничения последовательной обработки путем объединения всех видов вместе.NeuralRecon demonstrates real-time performance on serial inputs. VoRTX [1] overcomes the limitations of serial processing by combining all views together.

В этом методе введена модель трансформера для реконструкции TSDF (a transformer model for TSDF reconstruction) и включен механизм внимания на уровне 20-признаков для объединения множества видов. Другая модель на основе трансформеров, TransformerFusion [4], использует внимание на уровне вокселей, чтобы уделять внимание наиболее информативным признакам на изображениях с разных ракурсов.This method introduces a transformer model for TSDF reconstruction and incorporates a 20-feature attention mechanism to fuse multiple views. Another transformer-based model, TransformerFusion [4], uses voxel-level attention to pay attention to the most informative features in images from different views.

Нормали поверхности представляют геометрию сцены таким образом, что они в некотором смысле дополняют карты глубины. Следовательно, их можно использовать для ограничения оценок глубины или рассматривать как отдельный источник пространственной информации. Поэтому использование нормалей для 3D-реконструкции активно изучалось в последние годы. VolSDF [6] и NeUS [7] минимизируют фотометрические потери и дополнительно ограничивают SDF с помощью эйкональной потери, побуждая нормали, оцениваемые как градиенты SDF, иметь норму, равную 1. NeuRIS [12] регуляризирует предсказанные нормали с помощью априорных нормалей, прогнозируемых обучаемым методом, и использует фотометрическую согласованность множества видов между нормалями и глубинами, чтобы отфильтровать недостоверные ограничения. NeuralRoom [18] также использует сеть оценки нормалей и применяет потери нормалей для бестекстурных областей, которые не могут быть эффективно ограничены потерей фотометрической согласованности из-за неоднозначности формы и яркости. В отличие от NeurlS и NeuralRoom, в предлагаемом изобретении нормали получают непосредственно из самой спрогнозированной TSDF.Surface normals represent the scene geometry in a way that they are in some sense complementary to depth maps. Therefore, they can be used to constrain depth estimates or considered as a separate source of spatial information. Therefore, the use of normals for 3D reconstruction has been actively studied in recent years. VolSDF [6] and NeUS [7] minimize the photometric loss and further constrain the SDF using an eikonal loss, forcing the normals estimated as the gradients of the SDF to have norm equal to 1. NeuRIS [12] regularizes the predicted normals using a priori normals predicted by a trained method and exploits the multi-view photometric consistency between normals and depths to filter out invalid constraints. NeuralRoom [18] also uses a normal estimation network and applies a normal loss for textureless regions that cannot be effectively constrained by the photometric consistency loss due to shape and brightness ambiguities. Unlike NeurlS and NeuralRoom, in the proposed invention, normals are obtained directly from the predicted TSDF itself.

Кроме того, были рассмотрены различные способы применения пространственной сегментации для 3D реконструкции.In addition, various ways of applying spatial segmentation for 3D reconstruction were discussed.

SceneCode [19] получает представление сегментации с помощью VAE, обусловленное изображением RGB, и решает задачу объединения меток сегментации путем совместной оптимизации пространственно-ориентированных низкоразмерных кодов перекрывающихся изображений.SceneCode [19] obtains a VAE segmentation representation given an RGB image and solves the segmentation label fusion problem by jointly optimizing spatially-aware low-dimensional codes of overlapping images.

В ряде методов [13-17] применяется распознавание объектов и их замена моделями CAD. Хотя получаемые результаты являются визуально правдоподобными, они вряд ли способны восстановить реальную сцену, скорее создается 3D-модель, более или менее похожая на нее. Предлагаемое изобретение, напротив, направлено на реконструкцию реальной сцены.A number of methods [13-17] use object recognition and their replacement with CAD models. Although the results obtained are visually plausible, they are unlikely to be able to reconstruct the real scene; rather, a 3D model is created that is more or less similar to it. The proposed invention, on the contrary, is aimed at reconstructing the real scene.

Manhattan-SDF [5], который является ближайшим аналогом предложенной регуляризации сегментации нормалей, NSR (Normal-Segmentation Regularization), сфокусирован на улучшении реконструкции низкотекстурированных областей. Manhattan-SDF определяет области пола и стен с помощью предварительно обученной сети сегментации и побуждает нормали к поверхности полов и стен быть коллинеарными с тремя доминирующими направлениями, чтобы полученная реконструкция удовлетворяла предположению о манхэттенском мире. Однако бенчмарки для помещений, такие как ScanNet, содержат не манхэттенские сцены, имеющие более трех доминирующих направлений, для которых метод Manhattan-SDF неприменим.Manhattan-SDF [5], which is the closest analogue of the proposed normal segmentation regularization, NSR (Normal-Segmentation Regularization), focuses on improving the reconstruction of low-texture regions. Manhattan-SDF identifies floor and wall regions using a pre-trained segmentation network and forces the surface normals of floors and walls to be collinear with the three dominant directions so that the resulting reconstruction satisfies the Manhattan world assumption. However, indoor benchmarks such as ScanNet contain non-Manhattan scenes with more than three dominant directions, for which Manhattan-SDF is not applicable.

В последних методах объемной 3D-реконструкции имеет место нежелательный компромисс, когда приходится полагаться на входные данные и восстанавливать правильную геометрию сцены. Даже незначительные ошибки или несоответствия во входных данных могут привести к нарушению реконструируемой сцены. Глобальные артефакты могут проявляться в виде искаженных форм комнат, а реконструированные поверхности могут содержать локальные артефакты, такие как углубления, дыры и возвышения. К счастью, некоторые из этих проблем можно решить, используя априорные знания о сцене, поскольку большинство комнат окружены плоскими вертикальными стенами и плоским горизонтальным полом.Recent 3D volumetric reconstruction methods have an undesirable trade-off when relying on input data to reconstruct the correct scene geometry. Even minor errors or inconsistencies in the input data can lead to corruption of the reconstructed scene. Global artifacts can manifest themselves as distorted room shapes, and reconstructed surfaces can contain local artifacts such as depressions, holes, and elevations. Fortunately, some of these problems can be addressed using prior knowledge of the scene, since most rooms are surrounded by flat vertical walls and flat horizontal floors.

Сущность изобретенияThe essence of the invention

Предложен способ реконструкции 3D-сцены и ее визуализации с использованием нейронной сети, состоящей из базовой нейронной сети, включающей скелет и голову TSDF, и голову сегментации, содержащий этапы, на которых:A method is proposed for reconstructing a 3D scene and visualizing it using a neural network consisting of a base neural network including a TSDF skeleton and head, and a segmentation head, containing the following stages:

обучают базовую нейронную сеть получению TSDF объема для вокселей сцены, для чего выполняют следующие шаги:train a basic neural network to obtain volume TSDFs for scene voxels, for which the following steps are performed:

- вводят обучающие данные, включающие обучающую последовательность кадров RGB с соответствующими данными положения камеры, в скелет;- input training data, including a training sequence of RGB frames with corresponding camera position data, into the skeleton;

- вычисляют потери TSDF между прогнозом TSDF, полученным головой TSDF из выходов данных из скелета, и эталонным сканом;- calculate the TSDF loss between the TSDF prediction obtained by the TSDF head from the skeleton data outputs and the reference scan;

- получают с помощью головы сегментации из вывода данных из скелета прогноз сегментации, определяющий области "пола", "стен", "другого", для каждого вокселя;- obtain, using the segmentation head, from the output of the skeleton data a segmentation prediction defining the "floor", "walls", "other" regions for each voxel;

- вычисляют потери сегментации между прогнозом сегментации и эталоном сегментации;- calculate the segmentation loss between the segmentation forecast and the segmentation standard;

- вычисляют координаты нормалей как градиенты прогноза TSDF для значений TSDF в каждом вокселе по всем вокселям сцены;- calculate normal coordinates as gradients of the TSDF prediction for the TSDF values in each voxel over all voxels in the scene;

- вычисляют обычные потери нормалей для нормалей для областей стен и областей пола на основе результатов эталона TSDF и градиентов прогноза TSDF;- calculate the normal loss for wall region normals and floor region normals based on the TSDF reference results and TSDF prediction gradients;

- вычисляют общую функцию потерь как сумму потерь TSDF, потерь сегментации и обычных потерь нормалей;- calculate the overall loss function as the sum of the TSDF loss, segmentation loss, and regular normal loss;

- минимизируют общую функцию потерь;- minimize the overall loss function;

используют обученную базовую нейронную сеть для получения TSDF объема входной последовательности RGB-кадров реальной сцены;use a trained base neural network to obtain the TSDF of the volume of the input sequence of RGB frames of a real scene;

применяют к TSDF объему алгоритм, вычисляющий реконструкцию 3D-сцены;apply an algorithm to the TSDF volume that computes the reconstruction of the 3D scene;

осуществляют рендеринг реконструкции 3D-сцены для получения визуализации 3D-сцены.render the reconstruction of the 3D scene to obtain a visualization of the 3D scene.

При этом алгоритмом, вычисляющим реконструкцию 3D-сцены, является алгоритм марширующих кубов. Голову сегментации и голову TSDF используют параллельно во время обучения. Способ содержит дополнительные этапы, реализуемые после этапа вычисления координат нормалей, на которых:In this case, the algorithm that calculates the reconstruction of the 3D scene is the marching cubes algorithm. The segmentation head and the TSDF head are used in parallel during training. The method contains additional stages implemented after the stage of calculating the normal coordinates, in which:

выбирают нормаль в областях стен и вычисляют их вертикальные составляющие;select the normal in the wall areas and calculate their vertical components;

выбирают нормаль в областях пола и вычисляют их горизонтальные составляющие.select the normal in the floor areas and calculate their horizontal components.

Этап вычисления потерь нормалей может заключаться в следующем:The normal loss calculation step may be as follows:

в каждой точке в областях, в которых голова сегментации прогнозирует "стену", рассматривают вертикальные составляющие векторов нормали, и к обычным потерям нормалей прибавляют длину z-составляющей вектора нормали,at each point in the areas where the segmentation head predicts a "wall", the vertical components of the normal vectors are considered, and the length of the z-component of the normal vector is added to the usual normal losses,

и в каждой точке в областях, в которых голова сегментации прогнозирует "пол", рассматривают горизонтальные составляющие векторов нормали, а также рассматривают х- и у-составляющие вектора нормали, и прибавляют норму двумерного вектора, состоящего из этих двух составляющих, к обычным потерям нормалей.and at each point in the regions where the segmentation head predicts "floor", consider the horizontal components of the normal vectors, and also consider the x- and y-components of the normal vector, and add the norm of the two-dimensional vector consisting of these two components to the usual normal loss.

Общую функцию потерь можно вычислять как сумму потерь TSDF, потерь сегментации и потерь нормалей. Минимизацию общей функции потерь можно реализовать путем вычисления градиента общей функции потерь по всем параметрам базовой нейронной сети. Способ может дополнительно обеспечивать обратное распространение ошибки минимизированной общей функции потерь и обновление параметров базовой нейронной сети в соответствии с минимизированной функцией потерь, обновляя при этом параметры базовой нейронной сети. Этапы обучения повторяют до тех пор, пока общая функция потерь не перестанет уменьшаться. В другом варианте этапы обучения повторяют до тех пор, пока общая функция потерь не достигнет заданного порогового значения.The overall loss function can be calculated as the sum of the TSDF loss, segmentation loss, and normal loss. Minimization of the overall loss function can be implemented by calculating the gradient of the overall loss function over all parameters of the base neural network. The method can additionally provide backpropagation of the error of the minimized overall loss function and updating the parameters of the base neural network in accordance with the minimized loss function, while updating the parameters of the base neural network. The training stages are repeated until the overall loss function stops decreasing. In another embodiment, the training stages are repeated until the overall loss function reaches a specified threshold value.

Также предложено вычислительное устройство, содержащее процессор и память, в которой хранятся инструкции для выполнения этапов предложенного способа.A computing device is also proposed, containing a processor and memory, in which instructions for performing the steps of the proposed method are stored.

По меньшей мере один из множества модулей может быть реализован в виде нейронной сети. Функция, связанная с нейронной сетью, может выполняться посредством энергонезависимой памяти, энергозависимой памяти и процессора.At least one of the plurality of modules may be implemented as a neural network. The function associated with the neural network may be performed by means of non-volatile memory, volatile memory, and a processor.

Процессор может включать в себя один или несколько процессоров. При этом один или множество процессоров могут быть процессором общего назначения, таким как центральный процессор (CPU), процессор приложений (АР) или т.п., блок обработки только графики, такой как графический процессор (GPU), блок визуальной обработки (VPU) и/или специальный процессор для искусственного интеллекта, такой как нейронный процессор (NPU).The processor may include one or more processors. In this case, the one or more processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP) or the like, a graphics-only processing unit, such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or a special processor for artificial intelligence, such as a neural processing unit (NPU).

Эти один или несколько процессоров управляют обработкой входных данных в соответствии с заранее определенным рабочим правилом или сетью искусственного интеллекта (AI), хранящимися в энергонезависимой памяти или энергозависимой памяти и реализуемыми аппаратными средствами. Аппаратное средство представляет собой машиночитаемый носитель, на котором хранится программное обеспечение.These one or more processors control the processing of input data in accordance with a predetermined operating rule or artificial intelligence (AI) network stored in non-volatile memory or volatile memory and implemented by hardware. The hardware is a machine-readable medium on which software is stored.

В данном контексте предоставление посредством обучения означает, что путем применения алгоритма обучения к множеству обучающих данных создается предопределенное рабочее правило или нейронная сеть с требуемыми характеристиками. Обучение может осуществляться в самом устройстве, в котором выполняется обработка нейронной сети согласно варианту осуществления, и/или может быть реализовано на отдельном сервере/системе.In this context, provision by training means that by applying a training algorithm to a set of training data, a predetermined operating rule or neural network with the desired characteristics is created. The training may be performed in the device itself, in which the neural network processing is performed according to the embodiment, and/or may be implemented on a separate server/system.

Нейронная сеть (базовая нейронная сеть) может состоять из множества слоев нейронной сети. Каждый слой имеет множество значений весов и выполняет операцию слоя посредством вычисления предыдущего слоя и операции со множеством весов. Примеры базовых нейронных сетей включают, без ограничения перечисленным, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейронную сеть (BRDNN), генеративно-состязательные сети (GAN) и глубокие Q-сети.A neural network (basic neural network) may consist of multiple neural network layers. Each layer has multiple weight values and performs the layer operation by computing the previous layer and operating on the multiple weights. Examples of basic neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

Алгоритм обучения представляет собой способ обучения заранее определенного целевого устройства (например, робота) с использованием множества обучающих данных, чтобы побудить, разрешить или контролировать целевое устройство для выполнения определения или прогнозирования. Примеры алгоритмов обучения включают, без ограничения перечисленным, обучение с учителем, обучение без учителя, обучение с частичным привлечением учителя или обучение с подкреплением.A learning algorithm is a method for training a predetermined target device (e.g., a robot) using a set of training data to induce, enable, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Кроме того, предлагаемый способ, выполняемый электронным устройством, может быть реализован с использованием модели искусственного интеллекта.In addition, the proposed method, performed by an electronic device, can be implemented using an artificial intelligence model.

Визуальное понимание - это метод распознавания и обработки вещей аналогично человеческому зрению, и оно включает в себя, например, распознавание объекта, отслеживание объекта, поиск изображений, распознавание людей, распознавание сцен, 3D-визуализацию сцен или улучшение изображений.Visual understanding is a method of recognizing and processing things similar to human vision, and it includes, for example, object recognition, object tracking, image retrieval, people recognition, scene recognition, 3D scene visualization, or image enhancement.

Краткое описание чертежейBrief description of the drawings

Представленные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления со ссылкой на прилагаемые чертежи, на которых:The above and/or other aspects will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings, in which:

Фиг. 1 иллюстрирует предлагаемую процедуру обучения модифицированным методам 3D-реконструкции.Fig. 1 illustrates the proposed procedure for training modified 3D reconstruction methods.

Фиг. 2 иллюстрирует визуализацию 3D-сцен, реконструированных с использованием исходного VoRTX и VoRTX+NSR, полученных с помощью исходного VoRTX (а) и предлагаемого VoRTX+NSR (b).Fig. 2 illustrates the visualization of 3D scenes reconstructed using the original VoRTX and VoRTX+NSR obtained using the original VoRTX (a) and the proposed VoRTX+NSR (b).

Подробное описание изобретенияDetailed description of the invention

Настоящее изобретение может найти применение в системах визуальной аналитики внутри помещений, мобильных и наземных системах 3D-сканирования для различных сцен внутри помещений, например, квартир, офисов, торговых помещений и т.п. Предлагаемое изобретение направлено на обнаружение и выравнивание структурных элементов, таких как вертикальные и горизонтальные поверхности (например, пол и стены) в помещениях и может использоваться при 3D-реконструкции и визуализации 3D-сцены.The present invention can find application in indoor visual analytics systems, mobile and terrestrial 3D scanning systems for various indoor scenes, such as apartments, offices, retail spaces, etc. The proposed invention is aimed at detecting and aligning structural elements, such as vertical and horizontal surfaces (e.g., floors and walls) in rooms and can be used in 3D reconstruction and visualization of a 3D scene.

В данном описании использованы следующие термины.The following terms are used in this description.

Реконструкция 3D-сцены представляет собой облако точек или сетку, т.е. математическое представление 3D-объекта, полученное путем применения известных алгоритмов (например, марширующих кубов), к TSBF объему.The 3D scene reconstruction is a point cloud or mesh, i.e. a mathematical representation of the 3D object obtained by applying known algorithms (e.g. marching cubes) to the TSBF volume.

Визуализация 3D-сцены представляет собой визуализацию, например, на устройстве отображения, упомянутой реконструкции 3D-сцены, то есть двухмерного изображения, которое показывает, как выглядит объект, представленный облаком точек сцены или сеткой сцены.A 3D scene rendering is the rendering, for example on a display device, of the said 3D scene reconstruction, i.e. a two-dimensional image that shows what an object represented by a scene point cloud or scene mesh looks like.

Рендеринг - способ получения упомянутой визуализации 3D-сцены из реконструкции 3D-сцены.Rendering is a method of obtaining the mentioned visualization of a 3D scene from a reconstruction of a 3D scene.

Базовая нейронная сеть - любая известная нейронная сеть, реализующая получение TSDF объема для вокселей сцены из последовательности RGB-кадров с положениями камеры. Базовая нейронная сеть включает в себя скелет и голову TSBF.The base neural network is any known neural network implementing the TSDF volume derivation for scene voxels from a sequence of RGB frames with camera positions. The base neural network includes the TSBF skeleton and head.

Предлагается NSR (регуляризация сегментации нормалей), автоматическая модификация методов, которые выполняют реконструкцию сцены путем прогнозирования функции усеченного расстояния со знаком (Truncated Signed Bistance Function, TSBF). Предложенную регуляризацию можно интегрировать в любой известный метод реконструкции (базовую нейронную сеть), прогнозирующий TSBF по последовательности кадров с положениями камеры. Согласно настоящему изобретению, NSR включает в себя работу нескольких компонентов, один из них представляет собой модуль 3D разреженной сверточной сегментации (голову сегментации), кроме того, NSR включает в себя вычисление нормалей, определение нормалей в областях стен и полов и вычисление их отклонений от горизонтали/вертикали, как будет описано ниже.NSR (Normal Segmentation Regularization), an automatic modification of methods that perform scene reconstruction by predicting a Truncated Signed Bistance Function (TSBF), is proposed. The proposed regularization can be integrated into any known reconstruction method (base neural network) that predicts TSBF from a sequence of frames with camera positions. According to the present invention, NSR includes the operation of several components, one of which is a 3D sparse convolutional segmentation module (segmentation head), in addition, NSR includes normal calculation, normal detection in wall and floor areas and calculation of their deviations from horizontal/vertical, as will be described below.

Предлагаемая модификация рассматривает в процессе обучения структуру сцены путем обнаружения стен и пола в сцене и штрафования соответствующих нормалей к поверхности за отклонение от горизонтального и вертикального направлений, соответственно (при этом учитывается отклонение нормалей от требуемых направлений), как будет описано ниже.The proposed modification considers the scene structure during training by detecting walls and floors in the scene and penalizing the corresponding surface normals for deviation from the horizontal and vertical directions, respectively (while taking into account deviation of normals from the required directions), as will be described below.

Предлагаемое изобретение позволяет устранить глобальные артефакты при реконструкции 3D-сцены; благодаря предлагаемому изобретению устраняются искажения формы помещения, кроме того, изобретение позволяет избавиться от некоторых локальных артефактов, таких как углубления, дыры и возвышения, в реконструкции 3D-сцены и ее визуализации.The proposed invention allows eliminating global artifacts during the reconstruction of a 3D scene; thanks to the proposed invention, distortions of the shape of a room are eliminated, in addition, the invention allows getting rid of some local artifacts, such as depressions, holes and elevations, in the reconstruction of a 3D scene and its visualization.

В настоящее время предлагаемое изобретение можно применить к данным, полученным смартфонами с системами слежения (например, ARCore для смартфонов Android), или любым видеоданным с известной траекторией камеры (например, полученным с помощью RealSense Т265). Пользователь может снимать видео помещения, например, смартфоном, и с помощью предлагаемого изобретения получать реконструкцию 3D-сцены в виде сетки сцены или облака точек сцены, которая после рендеринга преобразуется в визуализацию 3D-сцены, которую пользователь может видеть на экране.Currently, the proposed invention can be applied to data obtained by smartphones with tracking systems (e.g., ARCore for Android smartphones), or any video data with a known camera trajectory (e.g., obtained using RealSense T265). The user can shoot video of a room, for example, with a smartphone, and with the help of the proposed invention obtain a reconstruction of a 3D scene in the form of a scene mesh or a scene point cloud, which after rendering is transformed into a visualization of a 3D scene that the user can see on the screen.

Можно осуществить рендеринг сетки или облака точек, тогда визуализация (изображение) сцены будет представлена в виде, необходимом пользователю. Его можно использовать, например, для приложений VR/AR (игр, дизайна интерьера, приложений для недвижимости или для навигации). Минимальный перечень компонентов: камера, модуль слежения, оценивающий положение и угол камеры, запоминающее устройство с процессором и возможностью применения нейронных сетей.It is possible to render a grid or a point cloud, then the visualization (image) of the scene will be presented in the form required by the user. It can be used, for example, for VR/AR applications (games, interior design, real estate applications or navigation). The minimum list of components: a camera, a tracking module that evaluates the position and angle of the camera, a memory device with a processor and the ability to use neural networks.

Основной технический результат, обеспечиваемый настоящим изобретением, заключается в следующем:The main technical result provided by the present invention is as follows:

- Предложена NSR, являющаяся модификацией методов реконструкции 3D-сцены, которая включает в себя новый обучаемый модуль и связанную с ним процедуру обучения с потерями нормалей и сегментации базовой нейронной сети на наборе данных, для которого имеется видео с положениями и углами камеры, эталонные реконструкции и разметка сегментации полов и стен. В результате обучения получают базовую нейронную сеть, принимающую на входе видео с положениями и ракурсами камер и прогнозирующую TSBF и сегментацию по 3 классам (пол, стены, другое). Причем, когда базовая нейронная сеть уже обучена, прогноз сегментации больше не требуется для основной цели (определения стен, пола и др.), и используется только прогноз TSDF, по которому можно вычислить сетку. NSR можно внедрить в произвольную обучаемую модель (нейронную сеть), которая выдает TSDF и обучается сквозным образом на семантически аннотированных 3D-данных. Существуют общедоступные наборы таких данных для обучения (обучающие последовательности RGB-кадров с соответствующими данными положений камеры), например, ScanNet [ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, CVPR, 2017] (который использовался для обучения базовой нейронной сети в изобретении). Реализована концепция "обучения с учителем", то есть для того, чтобы базовая нейронная сеть правильно прогнозировала TSDF и классы сегментации, ей необходимо получать примеры таких данных в процессе обучения.- NSR is proposed, which is a modification of 3D scene reconstruction methods, which includes a new trainable module and an associated training procedure with normal losses and segmentation of the base neural network on a dataset for which there is a video with camera positions and angles, reference reconstructions and floor and wall segmentation markings. As a result of training, a base neural network is obtained that accepts a video with camera positions and angles as input and predicts TSBF and segmentation by 3 classes (floor, walls, other). Moreover, when the base neural network is already trained, the segmentation prediction is no longer required for the main purpose (determining walls, floors, etc.), and only the TSDF prediction is used, from which the grid can be calculated. NSR can be implemented in an arbitrary trainable model (neural network) that outputs TSDF and is trained end-to-end on semantically annotated 3D data. There are publicly available sets of such training data (training sequences of RGB frames with corresponding camera position data), such as ScanNet [ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, CVPR, 2017] (which was used to train the base neural network in the invention). The concept of "supervised learning" is implemented, i.e., in order for the base neural network to correctly predict TSDF and segmentation classes, it needs to receive examples of such data during the training process.

- NSR применяется к ряду современных подходов к реконструкции и демонстрирует рост производительности и улучшение качества реконструкции по сравнению с современными методами реконструкции на нескольких наборах данных. Предлагаемая разработка представляет собой регуляризацию, которую можно применить к произвольному методу реконструкции сцены в комнате, который предсказывает TSDF на основе видео с положениями камеры.- NSR is applied to a number of state-of-the-art reconstruction approaches and demonstrates performance gains and reconstruction quality improvements over state-of-the-art reconstruction methods on several datasets. The proposed development is a regularization that can be applied to an arbitrary room scene reconstruction method that predicts TSDFs from videos of camera positions.

Основным вкладом настоящего изобретения является новая регуляризация геометрической сегментации, состоящая во введении вспомогательных потерь нормалей, оптимизированных вместе с дополнительной головой сегментации. Предложенные в изобретении блоки обучения для базовой нейронной сети повышают качество реконструкции сцены: охват, точность, полноту, F1-меру, и позволяют получать более плавные и ровные реконструкции с меньшим количеством "углублений" и "возвышений".The main contribution of the present invention is a new regularization of geometric segmentation, consisting of the introduction of auxiliary losses of normals optimized together with an additional segmentation head. The training blocks for the base neural network proposed in the invention improve the quality of scene reconstruction: coverage, accuracy, recall, F1-measure, and allow obtaining smoother and more even reconstructions with fewer "deepenings" and "elevations".

Во время обучения не вычисляются результаты 3D-реконструкции. В процессе обучения вычисляются прогноз TSBF, сегментации и прогноз нормалей, вычисляются значения функции потерь, и они минимизируются. Однако если для сцены прогнозируется TSDF, то на ее основе можно вычислить саму 3D-реконструкцию (с помощью алгоритма марширующих кубов). Во время обучения в этом нет необходимости, но, когда сеть уже обучена, это делается для получения реконструкции.During training, the 3D reconstruction results are not calculated. During training, the TSBF prediction, segmentations, and normal prediction are calculated, the loss function values are calculated, and they are minimized. However, if the TSDF is predicted for the scene, then the 3D reconstruction itself can be calculated based on it (using the marching cubes algorithm). This is not necessary during training, but once the network is trained, it is done to obtain the reconstruction.

Рассматриваемая в изобретении обученная базовая нейронная сеть характеризуется тем, что в результате ее работы (если также применяется алгоритм марширующих кубов) можно получить облако точек или треугольную сетку (сетку треугольников), соответствующую данной сцене. То есть, если это облако точек, то результатом является набор точек сцены (их координаты и цвета), а если это сетка, то результатом являются треугольные грани, соединяющие некоторые тройки этих точек (т.е. поверхность треугольников).The trained basic neural network considered in the invention is characterized by the fact that as a result of its operation (if the marching cubes algorithm is also used), it is possible to obtain a point cloud or a triangular grid (triangle grid) corresponding to a given scene. That is, if it is a point cloud, then the result is a set of scene points (their coordinates and colors), and if it is a grid, then the result is triangular faces connecting some triplets of these points (i.e., a surface of triangles).

В кратком изложении, процедура получения визуализации 3D сцены состоит в следующем:In brief, the procedure for obtaining a visualization of a 3D scene is as follows:

- используют базовую модель нейронной сети, обученной, как будет описано ниже, получению TSDF объема входной последовательности RGB-кадров реальной сцены.- use a basic neural network model trained, as described below, to obtain the TSDF of the volume of the input sequence of RGB frames of a real scene.

- применяют к TSDF объему алгоритм, вычисляющий реконструкцию 3D сцены (представление в виде треугольной сетки);- apply an algorithm to the TSDF volume that calculates the reconstruction of the 3D scene (representation in the form of a triangular grid);

- осуществляют рендеринг реконструкции 3D-сцены для получения визуализации 3D-сцены.- perform rendering of the 3D scene reconstruction to obtain a visualization of the 3D scene.

Реализованную в виде разреженного 3D сверточного модуля NSR можно внедрить в произвольную обучаемую модель, которая выдает на выходе TSDF, и ее можно обучать сквозным методом на облаках точек 3D сегментации.Implemented as a sparse 3D convolutional module, NSR can be embedded in an arbitrary trainable model that outputs a TSDF, and can be trained end-to-end on 3D segmentation point clouds.

В процессе логического вывода, при применении уже обученной базовой нейронной сети к новым (никогда ранее не встречавшимся) данным, модули, использовавшиеся при обучении базовой нейронной сети, не требуются, поэтому наличие NSR при обучении не накладывает никаких ограничений на сценарии применения.During the inference process, when applying an already trained base neural network to new (never seen before) data, the modules used in training the base neural network are not required, so the presence of NSR during training does not impose any restrictions on application scenarios.

Также предложено аппаратное средство, представляющее собой машиночитаемый носитель, на котором хранится программный продукт, реализующий способ реконструкции 3D-сцены с использованием регуляризации сегментации нормалей TSDF при обучении базовой нейронной сети, при этом предлагаемые этапы обучения, согласно способу, подходят для любых существующих базовых нейронных сетей для 3D реконструкции. Более конкретно, предлагаемые этапы обучения могут обучать любую базовую нейронную сеть, реализующую реконструкцию 3D-сцены с прогнозированием TSDF на основе RGB-видео с положениями камеры. При этом на вход обученной базовой нейронной сети подается цветное видео сцены и положения камер для каждого кадра этого видео, и результат в виде фиксированной 3D-реконструкции этой сцены получают после применения алгоритма к TSDF объему на выходе обученной базовой нейронной сети.Also proposed is a hardware device, which is a machine-readable medium, on which a software product is stored, implementing a method for reconstructing a 3D scene using the TSDF normal segmentation regularization during training of a base neural network, wherein the proposed training stages, according to the method, are suitable for any existing base neural networks for 3D reconstruction. More specifically, the proposed training stages can train any base neural network implementing the reconstruction of a 3D scene with TSDF prediction based on RGB video with camera positions. In this case, a color video of the scene and camera positions for each frame of this video are fed to the input of the trained base neural network, and the result in the form of a fixed 3D reconstruction of this scene is obtained after applying the algorithm to the TSDF volume at the output of the trained base neural network.

Настоящее изобретение можно использовать с любыми устройствами, выполненными с возможностью снимать видео и оценивать траекторию камеры (положения и углы камеры для каждого видеокадра). Это может быть камера слежения (например, RealSense Т265), смартфон (например, Android-смартфон с системой Google ArCore), робот-пылесос, оснащенный видеокамерой и системой слежения, или даже просто видеокамера, к данным которых применимы методы оценки траектории.The present invention can be used with any device capable of filming video and estimating the camera trajectory (camera positions and angles for each video frame). This can be a surveillance camera (e.g., RealSense T265), a smartphone (e.g., an Android smartphone with the Google ArCore system), a robot vacuum cleaner equipped with a video camera and a tracking system, or even just a video camera, to the data of which the trajectory estimation methods are applicable.

На фиг. 1 показана предлагаемая процедура обучения для модифицированных методов 3D-реконструкции. В кратком изложении, ключевая концепция предлагаемого изобретения заключается в способе обучения сети искусственного интеллекта (AI) (базовой нейронной сети) осуществлению 3D реконструкции сцены с помощью NSR (регуляризации сегментации нормалей), содержащем следующие этапы:Fig. 1 shows the proposed training procedure for the modified 3D reconstruction methods. Briefly, the key concept of the proposed invention is a method for training an artificial intelligence (AI) network (base neural network) to perform 3D scene reconstruction using NSR (normal segmentation regularization), comprising the following steps:

Этап 1: извлекают значение признаков изображения и прогнозируют представление 3D пространства с помощью нейронной сети. Следует отметить, что значения признака изображения - это элементы данных, представляющие максимальный интерес, которые будут передаваться и анализироваться нейронной сетью, т.е. значения признаков изображения являются внутренними (скрытыми, промежуточными) представлениями для нейронной сети. Этот этап выполняется блоками 2, 3, 5, 6, 9, показанными на фиг. 1. Как будет подробно описано ниже, блоки 2, 3, 5, 6, 9 осуществляют ввод обучающих данных в скелет; применяют голову TSDF для вычисления потерь TSDF.Step 1: extract the image feature values and predict the 3D space representation using the neural network. It should be noted that the image feature values are the data elements of maximum interest that will be fed and analyzed by the neural network, i.e. the image feature values are internal (hidden, intermediate) representations for the neural network. This step is performed by blocks 2, 3, 5, 6, 9 shown in Fig. 1. As will be described in detail below, blocks 2, 3, 5, 6, 9 perform the input of training data to the skeleton; use the TSDF head to calculate the TSDF loss.

Этап 2 используется только при обучении базовой нейронной сети: сегментируют пространство с помощью информации о сегментации (например, пол/стена). Этот этап выполняется блоками 4, 7, 8, 11, 12, показанными на фиг. 1. Как будет подробно описано ниже, блоки 4, 7, 8, 11, 12 осуществляют применение головы сегментации; вычисляют потери сегментации; определяют области пола и стен.Step 2 is used only when training the base neural network: segment the space using segmentation information (e.g. floor/wall). This step is performed by blocks 4, 7, 8, 11, 12 shown in Fig. 1. As will be described in detail below, blocks 4, 7, 8, 11, 12 apply the segmentation head; calculate the segmentation loss; determine the floor and wall areas.

Этап 3 используется только при обучении базовой нейронной сети: обучают базовую нейронную сеть таким образом, чтобы определенные пол и стена были плоскими. Этот этап выполняется блоками 10, 11, 13, 14, 15, 16, показанными на фиг. 1. Как будет подробно описано ниже, блоки 10, 11, 13, 14, 15, 16 выполняют вычисление координат нормалей как градиентов TSDF; выбирают нормали в областях стен и вычисляют их вертикальные составляющие; выбирают нормали в областях пола и вычисляют их горизонтальные составляющие; вычисляют обычные потери нормалей; обеспечивают обратное распространение ошибки путем обновления параметров нейронной сети в соответствии с вычисленным градиентом для уменьшения общей функции потерь.Step 3 is used only when training the base neural network: the base neural network is trained such that the specified floor and wall are flat. This step is performed by blocks 10, 11, 13, 14, 15, 16 shown in Fig. 1. As will be described in detail below, blocks 10, 11, 13, 14, 15, 16 calculate the coordinates of the normals as TSDF gradients; select the normals in the wall regions and calculate their vertical components; select the normals in the floor regions and calculate their horizontal components; calculate the usual loss of the normals; provide backpropagation of the error by updating the parameters of the neural network according to the calculated gradient to reduce the overall loss function.

Описанные выше этапы осуществляют многократно в процессе обучения базовой нейронной сети. Однако при работе уже обученной базовой нейронной сети выполняется только первый этап. Введенные обучающие данные используются только для обучения, а при работе обученной базовой нейронной сети вводимыми данными являются RGB-кадры и положение камеры в пространстве. Таким образом, блоки 4, 7, 10, 11, 12, 13, 14, 15, 16 не используются в работе обученной базовой нейронной сети.The above steps are performed repeatedly during the training of the base neural network. However, when the base neural network is already trained, only the first step is performed. The input training data is used only for training, and when the base neural network is trained, the input data are RGB frames and the camera position in space. Thus, blocks 4, 7, 10, 11, 12, 13, 14, 15, 16 are not used in the operation of the base neural network.

Следовательно, базовая нейронная сеть обучается новым способом. Когда базовая нейросеть уже обучена таким образом, можно получить 3D-реконструкцию сцены без артефактов в областях пола и стен в процессе работы базовой нейросети. При работе базовой нейронной сети после предложенного обучения вывод реконструкции 3D-сцены происходит обычным образом, как и в базовой версии базовой нейронной сети.Therefore, the basic neural network is trained in a new way. When the basic neural network is already trained in this way, it is possible to obtain a 3D scene reconstruction without artifacts in the floor and wall areas during the operation of the basic neural network. When the basic neural network operates after the proposed training, the output of the 3D scene reconstruction occurs in the usual way, as in the basic version of the basic neural network.

Таким образом, NSR - это регуляризация сегментации нормалей, которая применяется именно во время обучения. При выводе реконструкции 3D-сцены из уже обученной базовой нейронной сети используется базовая нейронная сеть, которая обучена лучше благодаря NSR. NSR - это модификация методов 3D-реконструкции сцены, работа которой построена на предположении традиционной структуры сцены.Thus, NSR is a normal segmentation regularization that is applied specifically during training. When deriving a 3D scene reconstruction from an already trained base neural network, the base neural network is used, which is better trained thanks to NSR. NSR is a modification of 3D scene reconstruction methods, the operation of which is based on the assumption of the traditional scene structure.

В процессе работы обученная базовая нейронная сеть просто определяет значение TSBF во всех вокселях, однако благодаря обучению NSR результат на выходе 3D-реконструкции таков, что полы и стены в ней получаются ровными.In the process, the trained base neural network simply determines the TSBF value in all voxels, but thanks to the NSR training, the output of the 3D reconstruction is such that the floors and walls in it are smooth.

Стены и пол находятся в реконструируемой сцене, и при обучении NSR накладывает ограничения на их нормали на этапах 2 и 3. Эти ограничения заключаются в том, что:The walls and floor are in the scene being reconstructed, and during training, NSR imposes constraints on their normals in steps 2 and 3. These constraints are that:

- нормали (т.е. перпендикуляры) к поверхности стен должны быть максимально горизонтальными, чтобы стены были вертикальными, и- normals (i.e. perpendiculars) to the surface of the walls should be as horizontal as possible so that the walls are vertical, and

- нормали к поверхности пола должны быть максимально вертикальными, чтобы пол был горизонтальным.- normals to the floor surface should be as vertical as possible so that the floor is horizontal.

Предположение "должны быть максимально горизонтальными/вертикальными" означает, что отклонение нормалей от требуемого направления штрафуется, то есть к минимизированной общей функции потерь добавляется член, соответствующий отклонению нормалей от требуемых направлений. Отклонение - это разность между текущим значением и идеальным значением. Эта разность может быть равна нулю, что соответствует отсутствию отклонения, но даже для качественной 3D-модели это будет ненулевое число, то есть с математической точки зрения оно является дополнительным членом. Минимизация общей функции потерь может осуществляться любым известным методом, однако в настоящем изобретении используются градиентные методы, например, метод адаптивного градиентного спуска, а именно метод Adam [Adam: А Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980].The assumption "should be maximally horizontal/vertical" means that deviation of normals from the required direction is penalized, i.e. a term corresponding to deviation of normals from the required directions is added to the minimized overall loss function. Deviation is the difference between the current value and the ideal value. This difference can be zero, which corresponds to no deviation, but even for a high-quality 3D model it will be a non-zero number, i.e. from a mathematical point of view it is an additional term. Minimization of the overall loss function can be performed by any known method, however, in the present invention gradient methods are used, for example, the adaptive gradient descent method, namely the Adam method [Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980].

Отдельные функции потерь состоят из значений, соответствующих степени различия между спрогнозированным TSDF и эталоном (потеря TSDF), степени различия между спрогнозированными семантическими метками и семантикой эталона (семантическая потеря), степени различия между нормалями к поверхности и нормалями к эталонам (обычные потери нормалей), степени не вертикальности стен и не горизонтальности пола (добавленные отклонения нормалей). Из таких отдельных функций потерь (штрафов), соответствующих различным критериям качества 3D-модели, формируется общая функция потерь, которая минимизируется в процессе обучения базовой нейронной сети.Individual loss functions consist of values corresponding to the degree of difference between the predicted TSDF and the reference (TSDF loss), the degree of difference between the predicted semantic labels and the reference semantics (semantic loss), the degree of difference between surface normals and reference normals (normal loss), the degree of non-verticality of the walls and non-horizontality of the floor (added normal deviations). From such individual loss functions (penalties) corresponding to different quality criteria of the 3D model, a common loss function is formed, which is minimized during the training of the base neural network.

Общие функции потерь описывают некоторое отображение, которое принимает прогноз в качестве ввода и возвращает значение, при этом, чем лучше прогноз, тем меньше это значение. Другими словами, общая функция потерь - это математическая интерпретация того, насколько текущая базовая нейронная сеть отличается от "идеальной". Причем чем меньше значение общей функции потерь, тем лучше, поэтому оно минимизируется, то есть выбираются значения таких базовых параметров нейронной сети, как веса, для которых значение общей функции потерь будет как можно меньшим (например, чем ближе проекция нормали к 0, тем ближе нормаль к идеалу). Общая функция потерь используется только для обучения базовой нейронной сети. Общая функция потерь содержит все ограничения, накладываемые на 3D-модель. А именно, как будет подробно описано ниже: ограничения на TSDF (они берутся из базовой нейронной сети, это может быть степень различия между спрогнозированным TSDF и эталоном, степень различия между прогнозом, будет ли данный воксель содержать точку сцены или нет на основе эталона, и т.д.) Кроме того, в общей функции потерь представлены условия различия между спрогнозированными семантическими метками и эталоном, различия между спрогнозированными нормалями и эталоном, отличие длины нормалей от единицы, отклонение нормалей к стенам/полам от горизонтального/вертикального направлений.General loss functions describe some mapping that takes a prediction as input and returns a value, where the better the prediction, the smaller this value. In other words, the general loss function is a mathematical interpretation of how much the current base neural network differs from the "ideal" one. Moreover, the smaller the value of the general loss function, the better, so it is minimized, that is, the values of such basic parameters of the neural network as weights are selected for which the value of the general loss function will be as small as possible (for example, the closer the projection of the normal to 0, the closer the normal is to the ideal). The general loss function is used only for training the base neural network. The general loss function contains all the constraints imposed on the 3D model. Namely, as will be detailed below: constraints on the TSDF (these are taken from the underlying neural network, this could be the degree of difference between the predicted TSDF and the reference, the degree of difference between the prediction whether a given voxel will contain a scene point or not based on the reference, etc.) In addition, the overall loss function represents the terms of the difference between the predicted semantic labels and the reference, the difference between the predicted normals and the reference, the difference of the length of the normals from one, the deviation of the normals to the walls/floors from the horizontal/vertical directions.

Отклонение нормалей стен от горизонтального направления равно длине их вертикальной составляющей (т.е. проекции нормали стены на ось z), то есть если нормаль к стене имеет идеальное горизонтальное направление, то вертикальная составляющая равна 0. Отклонение нормалей пола от вертикального направления равно длине их горизонтальной составляющей (т.е. проекции нормали к полу на плоскость х-у), то есть если нормаль к полу имеет идеальное горизонтальное направление, то горизонтальная составляющая равна 0.The deviation of wall normals from the horizontal direction is equal to the length of their vertical component (i.e. the projection of the wall normal onto the z axis), i.e. if the wall normal has an ideal horizontal direction, then the vertical component is equal to 0. The deviation of floor normals from the vertical direction is equal to the length of their horizontal component (i.e. the projection of the floor normal onto the x-y plane), i.e. if the floor normal has an ideal horizontal direction, then the horizontal component is equal to 0.

Вычисляют функцию TSDF в воксельном объеме для сцены, сохраняют расстояние от каждого вокселя до ближайшей поверхности как значение функции TSDF для каждого вокселя. Вычисляют нормали поверхностей в воксельном объеме сцены как градиенты функции TSDF (поскольку восстановленная поверхность является поверхностью уровня для TSDF, т.е. решение уравнения TSDF=0), которые также вычисляют и сохраняют для каждого вокселя.The TSDF function is calculated in the voxel volume for the scene, the distance from each voxel to the nearest surface is stored as the TSDF value for each voxel. The surface normals in the voxel volume of the scene are calculated as the gradients of the TSDF function (since the reconstructed surface is a level surface for the TSDF, i.e., the solution to the equation TSDF=0), which are also calculated and stored for each voxel.

Нормали перпендикулярны поверхности по определению, однако поверхность может быть не идеальной, например, пол может быть не горизонтальным, а наклонным или иметь углубления/возвышения, и стены могут быть не вертикальными. Следует отметить, что артефакты при реконструкции сцены возникают из-за многих факторов: неточной оценки траектории камеры, недостаточно обученной базовой нейронной сети или недостаточного покрытия сцены кадрами, из-за чего часть сцены может быть "невидимой". В данном изобретении используется естественное предположение, что в реальной сцене стены в комнатах вертикальные и не имеют артефактов, полы горизонтальные и не имеют артефактов, и обучение построено на этом предположении.Normals are perpendicular to the surface by definition, but the surface may not be perfect, for example, the floor may not be horizontal, but inclined or have depressions/elevations, and the walls may not be vertical. It should be noted that artifacts in scene reconstruction arise due to many factors: inaccurate camera trajectory estimation, insufficiently trained base neural network, or insufficient scene coverage with frames, due to which part of the scene may be "invisible". This invention uses the natural assumption that in a real scene, the walls in the rooms are vertical and have no artifacts, the floors are horizontal and have no artifacts, and training is based on this assumption.

3D-модель сцены представляет собой облако точек или сетку. 3D-реконструкцию сцены осуществляют на основе последовательности кадров, представляющих изображения сцены, и соответствующих положений и ракурсов камеры для каждого кадра из последовательности на основе 3D-модели сцены. Кадры изображения захватываются камерой, а параметры камеры оцениваются системой слежения или одним из соответствующих методов локализации, применяемых к видео в данной области техники.The 3D model of the scene is a point cloud or a mesh. The 3D reconstruction of the scene is performed based on a sequence of frames representing images of the scene and the corresponding positions and camera angles for each frame of the sequence based on the 3D model of the scene. The image frames are captured by a camera, and the camera parameters are estimated by a tracking system or one of the corresponding localization methods applied to video in this field of technology.

Базовая нейронная сеть, для которой может быть использовано предлагаемое изобретение, состоит из "скелета" - основной части, вычисляющей признаки, и "головы" TSDF. Две "головы" представляют собой вычисленные прогнозы на основе этих признаков, а именно прогнозирование TSDF вычисляется головой TSDF, а сегментации вычисляются головой сегментации. Другими словами, если взять произвольную базовую нейронную сеть для применения предложенного метода, то эта базовая нейронная сеть изначально имеет скелет и голову TSDF, а для обучения добавляется голова сегментации. Например, в Atlas [2] скелетом является 2D CNN, которая прогнозирует признаки в изображениях, проекцию в 3D и кодирующую часть 3D-сети, а оставшаяся декодирующая часть - это реконструкция. VoRTX [1] имеет аналогичную структуру, но также является промежуточной сетью-трансформером для слияния признаков.The basic neural network for which the proposed invention can be used consists of a "skeleton" - the main part that calculates features, and a TSDF "head". The two "heads" are the calculated predictions based on these features, namely, the TSDF prediction is calculated by the TSDF head, and the segmentations are calculated by the segmentation head. In other words, if we take an arbitrary basic neural network for applying the proposed method, then this basic neural network initially has a skeleton and a TSDF head, and a segmentation head is added for training. For example, in Atlas [2], the skeleton is a 2D CNN that predicts features in images, a projection in 3D and the encoding part of the 3D network, and the remaining decoding part is reconstruction. VoRTX [1] has a similar structure, but is also an intermediate transformer network for merging features.

Предлагаемое изобретение представляет собой способ визуализации 3D-сцены с использованием нейронной сети, состоящей из базовой нейронной сети, включающей скелет и голову TSDF, и головы сегментации, основанной на обучении с использованием регуляризации сегментации нормалей TSDF (т.е. с использованием определения семантических структурных элементов и их выравниванием), который содержит следующие этапы:The proposed invention is a method for visualizing a 3D scene using a neural network consisting of a base neural network including a TSDF skeleton and head, and a segmentation head based on learning using TSDF normal segmentation regularization (i.e. using the definition of semantic structural elements and their alignment), which contains the following steps:

Обучение базовой нейронной сети (как следует из фиг. 1):Training the basic neural network (as shown in Fig. 1):

- В скелет (блок 2 на фиг. 1) вводят обучающие данные, представляющие собой обучающие RGB-кадры с соответствующими данными камеры в виде входных тензоров (блок 1 на фиг. 1). Скелет обрабатывает входные тензоры и вычисляет промежуточные признаки, которые в дальнейшем будут использоваться для прогнозирования и сегментации TSDF. Выводы скелета являются промежуточными представлениями ввода, которые можно рассматривать как внутренние признаки этой последовательности.- The skeleton (block 2 in Fig. 1) is fed with training data, which are training RGB frames with corresponding camera data in the form of input tensors (block 1 in Fig. 1). The skeleton processes the input tensors and calculates intermediate features, which will be further used for prediction and segmentation of the TSDF. The skeleton outputs are intermediate representations of the input, which can be considered as internal features of this sequence.

- Применяют голову TSDF (блок 3 на фиг. 1) для промежуточных признаков, выданных скелетом. Голова TSDF прогнозирует TSDF объем, чтобы получить прогноз TSDF значений TSDF в каждом вокселе соответствующей сцены, причем каждый воксель имеет соответствующее значение TSDF в этом месте. Для каждого вокселя в пространстве необходимо определить расстояние от этого вокселя до поверхности реконструируемого объекта. Прогноз знака TSDF характеризует, на какой стороне поверхности расположен данный воксель, если он находится внутри этого объекта, то расстояние принимается со знаком минус, в противном случае со знаком плюс. Блок 6 представляет результат блока 3, то есть прогноз TSDF для всего вексельного объема.- The TSDF head (block 3 in Fig. 1) is used for the intermediate features produced by the skeleton. The TSDF head predicts the TSDF volume to obtain a TSDF prediction of the TSDF values in each voxel of the corresponding scene, with each voxel having a corresponding TSDF value at that location. For each voxel in space, the distance from that voxel to the surface of the reconstructed object must be determined. The TSDF sign prediction characterizes on which side of the surface the given voxel is located, if it is inside that object, the distance is taken with a minus sign, otherwise with a plus sign. Block 6 represents the result of block 3, i.e. the TSDF prediction for the entire voxel volume.

- Вычисляют потери TSDF между данными прогнозирования TSDF и данными эталонного скана (из блока 5 на фиг. 1). Вычисление выполняют после скелета в отдельном модуле "Потери TSDF" (блок 9 на фиг. 1). Этот модуль принимает TSDF и прогнозы эталонного скана в качестве ввода и вычисляет потери TSDF. Эталонный скан - это "корректный" скан, полученный с помощью лазерного сканера, датчика глубины или другим известным методом. Эталонный скан - это идеальная реконструкция, которую любая обучаемая базовая нейронная сеть стремится реконструировать при обучении. Потери являются показателем того, насколько текущая реконструкция отличается от идеальной (эталона). Базовую нейронную сеть обучают максимально точно реконструировать сцену, вычисляют потери и применяют метод градиентной оптимизации для минимизации общей функции потерь. Таким образом, параметры базовой нейронной сети итеративно обновляются, чтобы уменьшить потери, а прогнозы максимально приблизить к "идеалу".- The TSDF loss is calculated between the TSDF prediction data and the reference scan data (from block 5 in Fig. 1). The calculation is performed after the skeleton in a separate module "TSDF Loss" (block 9 in Fig. 1). This module takes the TSDF and reference scan predictions as input and calculates the TSDF loss. The reference scan is a "correct" scan obtained using a laser scanner, depth sensor or other known method. The reference scan is an ideal reconstruction that any trained base neural network strives to reconstruct during training. The loss is an indicator of how much the current reconstruction differs from the ideal (reference). The base neural network is trained to reconstruct the scene as accurately as possible, the loss is calculated and the gradient optimization method is applied to minimize the overall loss function. In this way, the parameters of the base neural network are iteratively updated to reduce the loss and bring the predictions as close to the "ideal" as possible.

Применяют голову сегментации (блок 4 на фиг. 1) к промежуточным признакам, выводимым из скелета, для получения прогноза сегментации (блок 7 на фиг. 1). Голова сегментации состоит из нескольких слоев, которые добавляются к базовой нейронной сети и прогнозируют метки сегментации в каждом вокселе. Метка сегментации - это метка класса, к которому принадлежит данный воксель, в данном случае метка класса "пол", "стены", "другое".The segmentation head (block 4 in Fig. 1) is applied to the intermediate features derived from the skeleton to obtain a segmentation prediction (block 7 in Fig. 1). The segmentation head consists of several layers that are added to the base neural network and predict segmentation labels at each voxel. A segmentation label is the class label to which a given voxel belongs, in this case the class label "floor", "walls", "other".

- Вычисляют потери сегментации (блок 12 на фиг. 1) между прогнозом сегментации (блок 7 на фиг. 1) и эталоном сегментации (блок 8 на фиг. 1). Эталон сегментации известен для обучающего набора.- The segmentation loss (block 12 in Fig. 1) is calculated between the segmentation prediction (block 7 in Fig. 1) and the segmentation standard (block 8 in Fig. 1). The segmentation standard is known for the training set.

Голова TSDF и голова сегментации могут работать параллельно. Потери TSDF вычисляются на основе прогноза TSDF (т.е. после него). Аналогично, потеря сегментации вычисляется на основе прогноза сегментации (т.е. после него).The TSDF head and the segmentation head can work in parallel. The TSDF loss is calculated based on (i.e. after) the TSDF prediction. Similarly, the segmentation loss is calculated based on (i.e. after) the segmentation prediction.

Все вычисленные потери (потери TSDF, потери сегментации и зависимые от нормали потери (которые будут описаны ниже) прибавляются к общей функции потерь, которая оптимизируется в процессе обучения нейронной сети. Вычисленные потери используются для обратного распространения ошибки, т.е. обновления параметров базовой нейронной сети в соответствии с градиентом общей функции потерь по этим параметрам.All computed losses (TSDF loss, segmentation loss, and normal-dependent loss (which will be described below) are added to a common loss function, which is optimized during the training of the neural network. The computed losses are used for backpropagation, i.e. updating the parameters of the underlying neural network according to the gradient of the common loss function with respect to these parameters.

Определяют области пола и стен из прогноза сегментации (блок 11 на фиг. 1)Determine the floor and wall areas from the segmentation forecast (block 11 in Fig. 1)

Это значит, что при сегментации определяют области пола и стен, то есть те воксели, для которых спрогнозирована метка "пол" или "стена". Для каждой точки пространства определяется, является ли она стеной, полом или чем-то еще. Таким образом, области стен и пола определяются за один проход.This means that during segmentation, floor and wall areas are determined, i.e. those voxels for which the label "floor" or "wall" is predicted. For each point in space, it is determined whether it is a wall, a floor, or something else. Thus, wall and floor areas are determined in one pass.

Вычисляют координаты нормалей в виде градиентов TSDF (нормаль - это 3D вектор) (блок 10 на фиг. 1). Нормали вычисляются для всех вокселей в пространстве (сцене) как градиенты спрогнозированного TSDF. Если поверхность является изоповерхностью некоторой функции, то градиент этой функции будет перпендикулярен поверхности. Сначала вычисляют нормали для всех вокселей, то есть направление нормали к поверхности в каждом вокселе вычисляют как градиент от функции TSDF. Если поверхность является поверхностью уровня некоторой функции (если это поверхность уровня 0 для TSDF, то решение TSDF=0), то градиенты этой функции перпендикулярны этой поверхности, то есть являются нормалями.The coordinates of the normals are calculated as TSDF gradients (a normal is a 3D vector) (block 10 in Fig. 1). The normals are calculated for all voxels in space (the scene) as gradients of the predicted TSDF. If the surface is an isosurface of some function, then the gradient of this function will be perpendicular to the surface. First, the normals are calculated for all voxels, i.e. the direction of the normal to the surface in each voxel is calculated as the gradient of the TSDF function. If the surface is a level surface of some function (if this is a level surface 0 for TSDF, then the solution TSDF=0), then the gradients of this function are perpendicular to this surface, i.e. they are normals.

Затем вычисляют все нормали, выбирают те, которые соответствуют вокселям стен (голова сегментации как раз прогнозирует, какие воксели являются стенами, какие полами, и какие - всем остальным). Для нормалей в вокселях стен вычисляют вертикальную составляющую. Аналогично для нормалей в вокселях пола вычисляют горизонтальную составляющую. А именно:Then all normals are calculated, those corresponding to wall voxels are selected (the segmentation head predicts which voxels are walls, which are floors, and which are all the rest). For normals in wall voxels, the vertical component is calculated. Similarly, for normals in floor voxels, the horizontal component is calculated. Namely:

Выбирают нормали в областях стен (блок 15 на фиг. 1) и вычисляют их вертикальные составляющие, т.е. z-составляющие нормали (блок 14 на фиг. 1), на основе данных, полученных из блоков 10 и 11), т.е. проекции нормали на вертикальную плоскость, при этом выбирают нормали, соответствующие вокселям, сегментированным как "стена". Если стена идеально ровная, то нормали во всех точках стены будут одинаковыми, если нет, то они могут различаться. Выбирают нормаль в областях пола (блок 15 на фиг. 1) и вычисляют их горизонтальные составляющие (блок 16 на фиг. 1), т.е. проекцию нормали на горизонтальную плоскость Оху, при этом выбирают нормали, соответствующие вокселям, сегментированным "пол".The normals in the wall regions (block 15 in Fig. 1) are selected and their vertical components, i.e. the z-components of the normal (block 14 in Fig. 1), are calculated based on the data obtained from blocks 10 and 11), i.e. the projections of the normal onto the vertical plane, while selecting the normals corresponding to the voxels segmented as "wall". If the wall is perfectly flat, the normals at all points of the wall will be the same, if not, they may differ. The normal in the floor regions (block 15 in Fig. 1) is selected and their horizontal components (block 16 in Fig. 1) are calculated, i.e. the projection of the normal onto the horizontal plane Oxy, while selecting the normals corresponding to the voxels segmented as "floor".

Вычисляют обычные потери нормалей (блок 13 на фиг. 1) (функции потерь для нормалей), которые традиционно используются для реконструкции, из блоков 10 (прогноз) и 5 (эталон).The usual normal losses (block 13 in Fig. 1) (loss functions for normals), which are traditionally used for reconstruction, are calculated from blocks 10 (prediction) and 5 (reference).

К обычным потерям нормалей относятся:Common normal losses include:

1. Различие между спрогнозированными нормалями (градиентами спрогнозированного TSDF) и эталонными (градиентами TSDF для эталонного скана). Для вычисления этих потерь необходимы спрогнозированные и реальные нормали, спрогнозированные нормали вычисляются из базовой нейронной сети, а реальные нормали из сканов, имеющихся в обучающих данных.1. Difference between predicted normals (gradients of predicted TSDF) and reference normals (gradients of TSDF for reference scan). To calculate this loss, predicted and real normals are needed, predicted normals are calculated from the underlying neural network, and real normals are calculated from scans available in the training data.

2. Эйкональная потеря - это отличие длин векторов нормалей от единицы (поскольку TSDF соответствует расстоянию до поверхности, то градиент этой функции в идеале должен иметь единицу длины. И отличие длины от единицы тоже штрафуется). Для вычисления этих потерь необходимы только спрогнозированные нормали.2. Eikonal loss is the difference between the lengths of the normal vectors and unity (since the TSDF corresponds to the distance to the surface, the gradient of this function should ideally have unity length. And the difference in length from unity is also penalized). To calculate these losses, only predicted normals are needed.

Обе эти обычные потери нормалей часто встречаются в предшествующем уровне техники, например, в [MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction, NeurlPS, 2022]).Both of these common normal losses are often found in prior art, such as [MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction, NeurlPS, 2022]).

- Вычисляют общую функцию потерь, которая равна сумме потерь TSBF, потерь сегментации и условных потерь нормалей;- Calculate the overall loss function, which is equal to the sum of the TSBF loss, segmentation loss, and conditional normal loss;

- Вычисляют градиент общей функции потерь относительно параметров базовой нейронной сети. Параметры базовой нейронной сети обновляют в соответствии с этим градиентом для уменьшения (минимизации) общей функции потерь от эпохи к эпохе обучения. То есть, осуществляют обратное распространение ошибки, известное из уровня техники. Минимизацию выполняют для того, чтобы найти такие базовые параметры нейронной сети, при которых прогнозы будут наиболее точными (принципы оптимизации на основе градиентов, используемые в нейронных сетях, были описаны выше). Процесс обучения нейронной сети заключается в выборе таких параметров нейронной сети, при которых прогноз TSDF будет максимально приближен к идеалу. Параметры базовой нейронной сети выбирают (обучают) таким образом, чтобы не было нерегулярностей. И окончательно обученная базовая нейросеть тут же реконструирует сцену так, что полы и стены становятся ровными. Базовая нейронная сеть во время обучения учитывает все отклонения стен и полов от "идеальных" плоскостей (вертикальных/горизонтальных) и штрафует (прибавляет к общей функции потерь) их все, независимо от того, происходит ли это отклонение на большой площади или на маленькой. Каждая точка, сегментированная как стены или полы, вносит вклад в оптимизацию базовой нейронной сети. И этот вклад равен отклонению нормали в данной точке от горизонтального/вертикального направления. В процессе обучения имеются такие параметры базовой нейронной сети, для которых общая функция потерь минимальна, при нахождении этих параметров сама общая функция потерь больше не используется, а используются эти "оптимальные" параметры базовой нейронной сети.- The gradient of the general loss function is calculated relative to the parameters of the base neural network. The parameters of the base neural network are updated in accordance with this gradient to reduce (minimize) the general loss function from epoch to epoch of training. That is, the backpropagation of the error, known from the state of the art, is carried out. Minimization is performed in order to find such basic parameters of the neural network at which the forecasts will be most accurate (the principles of gradient-based optimization used in neural networks were described above). The process of training the neural network consists of choosing such parameters of the neural network at which the TSDF forecast will be as close to the ideal as possible. The parameters of the base neural network are chosen (trained) in such a way that there are no irregularities. And the finally trained base neural network immediately reconstructs the scene so that the floors and walls become smooth. The basic neural network takes into account all deviations of walls and floors from the "ideal" planes (vertical/horizontal) during training and penalizes (adds to the general loss function) them all, regardless of whether this deviation occurs on a large area or on a small one. Each point, segmented as walls or floors, contributes to the optimization of the basic neural network. And this contribution is equal to the deviation of the normal at this point from the horizontal/vertical direction. During training, there are such parameters of the basic neural network for which the general loss function is minimal, when finding these parameters, the general loss function itself is no longer used, but these "optimal" parameters of the basic neural network are used.

- Обратное распространение ошибки обеспечивает обновление параметров базовой нейронной сети в соответствии с минимизированной функцией потерь. Общая функция потерь уменьшается после каждой эпохи обучения. Этапы обучения повторяются до тех пор, пока общая функция потерь не перестанет уменьшаться или пока общая функция потерь не достигнет заданного порогового значения, заданного разработчиком. Также процесс обучения может выполняться в течение фиксированного количества эпох, или же, как и в базовых версиях этой нейронной сети, могут также применяться и другие критерии остановки процесса обучения, например, до тех пор, пока не будет достигнуто значение общей функции потерь в следующей эпохе обучения ниже заданного порогового значения.- Backpropagation of the error ensures that the parameters of the underlying neural network are updated according to the minimized loss function. The overall loss function is reduced after each training epoch. The training steps are repeated until the overall loss function stops reducing or until the overall loss function reaches a specified threshold value specified by the developer. The training process can also be performed for a fixed number of epochs, or, as in the basic versions of this neural network, other criteria for stopping the training process can also be applied, for example, until the overall loss function in the next training epoch is below a specified threshold.

На этапе работы обученной базовой нейронной сети используются только блоки 1, 2, 3, 6. Работа обученной базовой нейронной сети содержит следующие этапы: последовательность RGB кадров сцены снимают камерой с разных ракурсов и положений камеры в пространстве. Данные камеры содержат положение камеры в пространстве в фиксированной системе координат, которая определяется трехмерным вектором, и угол камеры, представляющий собой матрицу поворота 3×3 (матрицу направляющего косинуса). Матрицу поворота 3×3 можно получить любым способом, например, SfM (например, COLMAP (Structure-from-Motion Revisited, CVPR, 2016) или каким-нибудь SLAM (например, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE Transactions on Robotics, 2015). Входные RGB-кадры с соответствующими данными камеры берутся из камеры, с помощью которой они получены, или из памяти устройства, в котором реализован предлагаемый метод. Если камера выполнена с возможностью вычисления положения камеры (например, Android-смартфон с трекером ArCore), то эти данные могут быть сразу переданы на устройство, в котором будут выполняться вычисления. В противном случае сначала необходимо получить соответствующие данные камеры (возможно, на том же устройстве, которое будет вычислять все остальное) головы TSDF.At the stage of the trained base neural network operation, only blocks 1, 2, 3, 6 are used. The operation of the trained base neural network contains the following stages: a sequence of RGB frames of the scene is shot by a camera from different angles and camera positions in space. The camera data contains the camera position in space in a fixed coordinate system, which is determined by a three-dimensional vector, and the camera angle, which is a 3×3 rotation matrix (direction cosine matrix). The 3×3 rotation matrix can be obtained by any method, such as SfM (e.g., COLMAP (Structure-from-Motion Revisited, CVPR, 2016) or some SLAM (e.g., ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE Transactions on Robotics, 2015). The input RGB frames with the corresponding camera data are taken from the camera that acquired them or from the memory of the device that implements the proposed method. If the camera is capable of calculating the camera position (e.g., an Android smartphone with an ArCore tracker), then this data can be immediately transferred to the device that will perform the calculations. Otherwise, the corresponding camera data must first be obtained (possibly on the same device that will calculate everything else) of the TSDF head.

- Кадры RGB и данные камеры вводятся в скелет обученной базовой нейронной сети в виде входных тензоров. Обученная базовая нейронная сеть получает TSDF объем входной последовательности RGB-кадров реальной сцены. Трехмерная поверхность определяется как изоповерхность нулевого уровня TSDF. Это осуществляется с помощью известных алгоритмов, например, алгоритма марширующих кубов (представленного в работе "Marching cubes: A high resolution 3D surface construction algorithm", SIGGRAPH, 1987). Таким образом, для RGB-видео сцены с известной траекторией камеры 3D реконструкция сцены задается в виде облака точек или треугольной сетки, как описано выше. При этом пол и стены в полученной реконструкции 3D сцены не имеют локальных артефактов, таких как углубления, дыры и возвышения, и т.п. Пол и стены сегментируются, т.е. каждый воксель помечается меткой класса, которому соответствует данная точка. В настоящем изобретении этими классами являются "пол", "стена", "другое".- The RGB frames and camera data are input into the skeleton of the trained base neural network as input tensors. The trained base neural network receives the TSDF volume of the input sequence of RGB frames of the real scene. The 3D surface is defined as the isosurface of the zero level of the TSDF. This is done using known algorithms, for example, the marching cubes algorithm (presented in the paper "Marching cubes: A high resolution 3D surface construction algorithm", SIGGRAPH, 1987). Thus, for an RGB video scene with a known camera trajectory, the 3D scene reconstruction is specified as a point cloud or a triangular mesh, as described above. In this case, the floor and walls in the resulting 3D scene reconstruction do not have local artifacts such as depressions, holes and elevations, etc. The floor and walls are segmented, i.e. each voxel is marked with a label of the class to which the given point corresponds. In the present invention, these classes are "floor", "wall", "other".

- Далее выполняется рендеринг реконструкции 3D-сцены для получения визуализации 3D-сцены. Полученная в результате визуализация 3D-сцены затем отображается пользователю.- Next, the 3D scene reconstruction is rendered to produce a 3D scene visualization. The resulting 3D scene visualization is then displayed to the user.

Предлагаемая обученная базовая нейронная сеть формируется таким образом, чтобы области, соответствующие полу, были плоскими и горизонтальными, а области, соответствующие стенам, были плоскими и вертикальными.The proposed trained base neural network is formed such that the regions corresponding to the floor are flat and horizontal, and the regions corresponding to the walls are flat and vertical.

В отличие от традиционных методов 3D-реконструкции, оптимизирующих геометрическую корректность сканов, в предлагаемом методе регуляризации структурные элементы сцены (пол и стены) дополнительно сегментируются, а базовая нейронная сеть оптимизируется так, чтобы эти элементы были плоскими, поверхность пола сцены была горизонтальной, а стены - вертикальными.Unlike traditional 3D reconstruction methods that optimize the geometric correctness of scans, in the proposed regularization method, structural elements of the scene (floor and walls) are additionally segmented, and the underlying neural network is optimized so that these elements are flat, the floor surface of the scene is horizontal, and the walls are vertical.

Вспомогательный модуль сегментации стен и пола (голова сегментации), а именно дополнительные слои, которые добавляются после скелета для прогнозирования меток сегментации, обучается вместе с базовой нейронной сетью, прогнозирующей функцию усеченного расстояния со знаком (TSDF) сцены, и в областях, соответствующих полу и стенам, оцениваются направления нормалей к поверхности сцены, то есть оцениваются все поверхности, находящиеся в сцене. В областях стен штрафуется вертикальная составляющая этой нормали, а в областях пола штрафуется горизонтальная составляющая, это означает, что:The auxiliary module of wall and floor segmentation (segmentation head), namely additional layers that are added after the skeleton to predict segmentation labels, is trained together with the basic neural network that predicts the truncated signed distance function (TSDF) of the scene, and in the areas corresponding to the floor and walls, the directions of the normals to the scene surface are estimated, i.e. all surfaces in the scene are estimated. In the wall areas, the vertical component of this normal is penalized, and in the floor areas, the horizontal component is penalized, which means that:

в каждой точке областей, в которых голова сегментации прогнозирует класс «стена», рассматриваются вертикальные составляющие векторов нормалей, и длина этой вертикальной составляющей (то есть z-составляющая вектора нормали) прибавляется к функции потерь,at each point in the regions where the segmentation head predicts the class "wall", the vertical components of the normal vectors are considered, and the length of this vertical component (i.e. the z-component of the normal vector) is added to the loss function,

и в областях, в которых прогнозируется класс "пол", рассматривается горизонтальная составляющая (х- и у-составляющие векторов нормалей) и прибавляется функция потерь для этой составляющей (норма двумерного вектора, составленного из этих двух составляющих).and in the areas where the class "floor" is predicted, the horizontal component (x- and y-components of the normal vectors) is considered and the loss function for this component (the norm of the two-dimensional vector composed of these two components) is added.

Если в функции потерь есть такие составляющие х, у, z, то параметры базовой нейронной сети обновляются таким образом, чтобы эти члены стали меньше. Меньшее значение этих членов приводит к большему выравниванию поверхности сцены в областях пола и стен. Если нормали пола становятся более вертикальными, то сами полы становятся более горизонтальными, и то же самое касается стен.If the loss function has terms x, y, z, then the parameters of the underlying neural network are updated to make these terms smaller. Smaller values of these terms result in a more level scene surface in the floor and wall areas. If the floor normals become more vertical, then the floors themselves become more horizontal, and the same goes for the walls.

То есть оптимизация осуществляется методами градиентного спуска (а именно методом стохастической оптимизации [Adam: А Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980], общепринятыми для обучения нейронных сетей. Оптимизация позволяет выравнивать области стен и полов, поскольку в результате этого процесса оптимизации уменьшаются значения общей функции потерь, включая члены, соответствующие нормалям к стенам и полу. Чем меньше упомянутые члены, тем меньше нормали отклоняются от "корректных" направлений и, соответственно, более выровненными становятся сами поверхности пола и стен. Такая оптимизация позволяет выровнять области стен и пола и предотвратить появление ошибок реконструкции, таких как закругления стен, углубления, дыры и возвышения на поверхностях.That is, the optimization is carried out by gradient descent methods (namely, the stochastic optimization method [Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980], which are generally accepted for training neural networks. Optimization allows for the alignment of wall and floor areas, since as a result of this optimization process, the values of the general loss function are reduced, including the terms corresponding to the normals to the walls and floor. The smaller the mentioned terms, the less the normals deviate from the "correct" directions and, accordingly, the more aligned the floor and wall surfaces themselves become. Such optimization allows for the alignment of wall and floor areas and prevents the occurrence of reconstruction errors, such as wall roundings, depressions, holes, and elevations on the surfaces.

A. Новая голова сегментации (блок 4 на фиг. 1).A. New segmentation head (block 4 in Fig. 1).

Голова сегментации представляет собой разреженную 3D сверточную сеть, созданную на основе разреженной 3D сети U-Net. Поскольку архитектуры моделей прогнозирования TSDF также имеют архитектуру U-Net, голова сегментации ответвляется от своего энкодера и состоит из двух разреженных 3D сверточных модулей. Каждый модуль включает в себя разреженную 3D-свертку с последующей батч нормализацией и два остаточных 3D-сверточных блока. Предлагаемая голова сегментации выводит 3D-сегментацию в виде пространственной карты с тремя каналами, каждый из которых соответствует одной категории сегментации: стене, полу и другому.The segmentation head is a sparse 3D convolutional network built on the sparse 3D U-Net network. Since the TSDF prediction model architectures also have a U-Net architecture, the segmentation head branches off from its encoder and consists of two sparse 3D convolution modules. Each module includes a sparse 3D convolution followed by batch normalization and two residual 3D convolution blocks. The proposed segmentation head outputs the 3D segmentation as a spatial map with three channels, each corresponding to one segmentation category: wall, floor, and other.

B. ОбучениеB. Training

Обучается базовая нейронная сеть, которая прогнозирует TSDF объем сцены. Эта базовая нейронная сеть обучается таким образом, чтобы области пола и стены стремились стать горизонтальными и вертикальными, соответственно. Для этого базовую нейронную сеть обучают прогнозировать метку сегментации вместе с TSDF. Когда прогнозируются TSDF и метки сегментации, нормали поверхности вычисляются как градиенты TSDF. Рассматриваются области пола и стен. Для точек, сегментированных как пол, к функции потерь прибавляется отклонение соответствующих нормалей от вертикального направления. Аналогично, для точек, сегментированных как стены, к функции потерь прибавляется отклонение соответствующих нормалей от горизонтального направления. Эти члены потерь направлены на то, чтобы сделать поверхность пола более горизонтальной, а поверхности стен более вертикальными.A base neural network is trained that predicts the TSDF volume of a scene. This base neural network is trained such that the floor and wall regions tend to be horizontal and vertical, respectively. To achieve this, the base neural network is trained to predict the segmentation label together with the TSDF. When the TSDF and segmentation labels are predicted, the surface normals are computed as the gradients of the TSDF. The floor and wall regions are considered. For points segmented as floors, the deviation of the corresponding normals from the vertical direction is added to the loss function. Similarly, for points segmented as walls, the deviation of the corresponding normals from the horizontal direction is added to the loss function. These loss terms aim to make the floor surface more horizontal and the wall surfaces more vertical.

Предлагается обучение головы 3D-сегментации на датасете ScanNet с облаками точек, аннотированными метками сегментации для каждой точки, т.е. датасет содержит облака точек, в которых каждая точка аннотирована (помечена) меткой соответствующего класса сегментации. Исходные категории ScanNet пол и ковер преобразованы в категорию "пол", а стена, окно, дверь, картина вошли в категорию "стены". Все точки, считающиеся не принадлежащими ни к категории стен, ни к категории пола, подпадают под категорию "другое". Отдельная категория потолка не рассматривается, так как в сканах ScanNet слишком мало точек потолка из-за процесса съемки.We propose training a 3D segmentation head on the ScanNet dataset with point clouds annotated with segmentation labels for each point, i.e. the dataset contains point clouds in which each point is annotated (labeled) with a label of the corresponding segmentation class. The original ScanNet categories floor and carpet are transformed into the "floor" category, and wall, window, door, painting are included in the "walls" category. All points considered to belong neither to the wall nor to the floor category fall under the "other" category. A separate ceiling category is not considered, since there are too few ceiling points in the ScanNet scans due to the shooting process.

Обучение проводят в два этапа. На первом этапе обучают базовую нейронную сеть, которая прогнозирует TSDF, вместе с головой 3D-сегментации, которая учится разделять точки на три категории (пол, стены и другое), руководствуясь потерями сегментации. Геометрия сцены не подвергается штрафованию в течение первых эпох, чтобы не нарушить процедуру обучения ранними ошибочными оценками классов сегментации. Это значит, что на начальных этапах еще не обученная базовая нейронная сеть будет, вероятнее всего, некорректно предсказывать принадлежность к классам "пол", "стены" ", "другое".The training is carried out in two stages. In the first stage, the base neural network that predicts the TSDF is trained together with the 3D segmentation head that learns to divide points into three categories (floor, walls, and other) based on the segmentation loss. The scene geometry is not penalized during the first epochs, so as not to disrupt the training procedure with early erroneous estimates of the segmentation classes. This means that in the initial stages, the still untrained base neural network will most likely incorrectly predict membership in the classes "floor", "walls", "other".

В течение первых эпох обучения предлагаемый штраф не применяется к нормалям пола и стены. Это означает, что в этих первых эпохах члены, соответствующие вертикальным/горизонтальным составляющим нормалей стены/пола, не прибавляются к общей функции потерь. Нейронная сеть преимущественно обучается реконструировать полную сцену (используя голову TSDF) и определять стены, пол и другое (используя голову сегментации).During the first epochs of training, the proposed penalty is not applied to the floor and wall normals. This means that in these first epochs, the terms corresponding to the vertical/horizontal components of the wall/floor normals are not added to the overall loss function. The neural network is primarily trained to reconstruct the full scene (using the TSDF head) and to detect walls, floors, etc. (using the segmentation head).

Когда базовая нейронная сеть уже способна определить, где находятся стены и пол, к общей функции потерь добавляются члены для нормалей и оптимизируется общая функция потерь. При этом потери TSDF и сегментации также продолжают участвовать в оптимизации.Once the base neural network is able to determine where the walls and floor are, normal terms are added to the overall loss function and the overall loss function is optimized. The TSDF and segmentation losses also continue to participate in the optimization.

В общем случае термин "штрафование" означает, что к общей функции потерь добавляется соответствующий член. Тот факт, что в начале обучения не добавляются члены для нормалей, продиктован здравым смыслом: пока базовая нейросеть еще не обучена, она скорее всего некорректно определит полы и стены, поэтому будут скорректированы неправильные нормали. С оштрафованными данными ничего не происходит. Различие между прогнозами и "идеалом" "штрафуется", и этот термин означает, что данное различие добавляется к минимизированной общей функции потерь. В последующие эпохи добавляются потери на нормалях поверхностей и процесс продолжается с объединенным прогнозированием геометрии и сегментаций сцены (блоки 6, 7 на фиг. 1). То есть члены, соответствующие вертикальным/горизонтальным составляющим нормалей стены/пола, добавляются к общей функции потерь. Процесс оптимизации продолжается. Из оценочного представления TSBF выводятся нормали к поверхностям: являясь производными первого порядка поверхностной функции, они легко вычисляются одной сверткой. Оценка нормалей реализуется как специальный необучаемый 3D сверточный слой, поэтому эту операцию можно легко включить в обучаемую модель.In general, the term "penalization" means that a corresponding term is added to the overall loss function. The fact that no normal terms are added at the beginning of training is dictated by common sense: while the base neural network is not yet trained, it will most likely incorrectly detect floors and walls, so the wrong normals are corrected. Nothing happens to the penalized data. The difference between the predictions and the "ideal" is "penalized", and this term means that this difference is added to the minimized overall loss function. In subsequent epochs, the loss on surface normals is added and the process continues with the combined prediction of geometry and scene segmentations (blocks 6, 7 in Fig. 1). That is, the terms corresponding to the vertical/horizontal components of wall/floor normals are added to the overall loss function. The optimization process continues. From the TSBF estimation representation, surface normals are derived: being first-order derivatives of the surface function, they are easily computed by a single convolution. Normal estimation is implemented as a special untrainable 3D convolutional layer, so this operation can be easily incorporated into a trainable model.

С. ПотериS. Losses

Используются две группы потерь в зависимости от нормалей поверхности. Во-первых, эксплуатируются обычные потери нормалей. Извлекаются эталонные нормали из эталонного представления TSDF и штрафуется (разность между прогнозом и эталоном добавляется как член к функции потерь, которая минимизируется) расхождение между спрогнозированными и эталонными нормалями с использованием как косинусного расстояния (расстояния, обратного нормализованному скалярному произведению векторов), так и евклидова расстояния. Эти эталонные потери дополняются безэталонной эйкональной потерей, которая регуляризирует L2-норму нормали, вынуждая ее стремиться к 1. Эталонные данные используются только на этапе обучения. После обучения базовую нейронную сеть можно применять к данным, для которых нет эталона.Two sets of losses are used depending on the surface normals. First, a regular normal loss is exploited. The reference normals are extracted from the reference TSDF representation and the discrepancy between the predicted and reference normals is penalized (the difference between the prediction and the reference is added as a term to the loss function, which is minimized) using both the cosine distance (the distance that is the inverse of the normalized dot product of vectors) and the Euclidean distance. This reference loss is complemented by a reference-free eikonal loss that regularizes the L2-norm of the normal, forcing it to converge to 1. The reference data is used only during the training phase. After training, the basic neural network can be applied to data for which there is no reference.

Кроме обычных потерь введена регуляризация сегментации нормалей, которая рассматривает как нормали поверхности, так и 3D-сегментацию (сегментационную разметку (аннотацию с метаклассом) каждой трехмерной точки). В общем, стимулом является, чтобы стены были вертикальными, а пол - горизонтальным. Для удобства обозначим набор 3D-точек, классифицированных как пол, символом F, а стены символом W.In addition to the usual losses, a normal segmentation regularization is introduced, which considers both the surface normals and the 3D segmentation (the segmentation markup (annotation with a metaclass) of each 3D point). In general, the incentive is for the walls to be vertical and the floor to be horizontal. For convenience, we denote the set of 3D points classified as the floor by the symbol F, and the walls by the symbol W.

W. Регуляризуются невертикальные составляющие нормали n_x, n_y в каждой точке :W. Non-vertical components of the normal n _x , n _y are regularized at each point :

Для стен применяется L1-потеря для вертикальной составляющей ^:For walls, the L1 loss is applied for the vertical component ^:

В абляционном исследовании изучалось, как каждая потеря влияет на качество реконструкции, и было доказано, что они дополняют друг друга, затрагивая различные аспекты реконструкции.The ablation study examined how each loss impacted the quality of reconstruction and showed that they complemented each other, affecting different aspects of reconstruction.

D. ВыводD. Conclusion

Во время вывода 3D-сегментация не требуется. Следовательно, аннотация 3D сегментации служит только для целей обучения: руководствуясь дополнительной информацией о сегментации, метод обучается более точно прогнозировать TSDF.During inference, 3D segmentation is not required. Therefore, the 3D segmentation annotation serves only for training purposes: guided by additional segmentation information, the method learns to predict the TSDF more accurately.

ЭкспериментыExperiments

А. Базовые методыA. Basic methods

В качестве базовых методов рассматривалось несколько современных методов реконструкции TSDF. Были выбраны методы, основанные на различных принципах работы, чтобы доказать применимость предлагаемой модификации к широкому спектру подходов к слиянию нескольких видов. То есть в данной области техники известно множество различных методов реконструкции TSDF для сцены, снятой с разных ракурсов. Предложенную в изобретении регуляризацию можно применить с любым из них, независимо от того, на каких принципах они основаны. Для подтверждения этого было рассмотрено несколько существующих методов, основанных на различных принципах (описаны ниже).Several modern TSDF reconstruction methods were considered as basic methods. Methods based on different principles of operation were selected to prove the applicability of the proposed modification to a wide range of approaches to merging several views. That is, many different TSDF reconstruction methods for a scene taken from different angles are known in this field of technology. The regularization proposed in the invention can be applied to any of them, regardless of the principles they are based on. To confirm this, several existing methods based on different principles were considered (described below).

В частности, использовались основанный на преобразователе VoRTX [1] и основанный на усреднении Atlas [2] (блоки 2, 3 на фиг. 1 (скелет и голова TSDF).In particular, the VoRTX [1] converter-based and the Atlas [2] averaging-based were used (blocks 2, 3 in Fig. 1 (TSDF skeleton and head).

Atlas [2] объединяет объекты в единый объем признаков. В нем отсутствует оценка глубины и выполняется трехмерная реконструкция с прямым прогнозом TSDF объема. Эта схема позволяет рассматривать входные изображения совместно и эффективно. Более того, Atlas оснащен механизмом, который позволяет реконструировать невидимые области сцены с помощью априорных 3D-изображений.Atlas [2] combines objects into a single feature volume. It does not involve depth estimation and performs 3D reconstruction with direct prediction of the TSDF of the volume. This scheme allows considering input images jointly and efficiently. Moreover, Atlas is equipped with a mechanism that allows reconstructing invisible regions of the scene using 3D prior images.

VoRTX [1] - это метод объемной 3D-реконструкции от начала до конца, осуществляющий объединение нескольких изображений с помощью преобразователей.VoRTX [1] is an end-to-end 3D volumetric reconstruction method that combines multiple images using transforms.

VoRTX сохраняет мелкие детали путем улучшения слияния в зависимости от положения камеры, и обрабатывает перекрытия, оценивая исходную геометрию сцены, чтобы исключить проецирование элементов изображения в закрытые области. Закрытые области возникают, если один объект перекрывает другой с ракурса съемки камеры, в этом случае некоторые области могут быть не видны (конкретно на видеокадрах). Сочетание этих приемов позволяет добиться самого высокого качества реконструкции.VoRTX preserves fine details by improving fusion based on camera position, and handles occlusions by estimating the original scene geometry to avoid projecting image elements into occluded areas. Occluded areas occur when one object occludes another from the camera's perspective, in which case some areas may not be visible (specifically in video frames). The combination of these techniques allows for the highest quality reconstruction.

B. ДатасетыB. Datasets

Предложенный метод обучали на ScanNet [8], который содержит 1613 сцен в помещениях с эталонными положениями камеры, 3D-реконструкциями и метками сегментации. Всего содержится 2,5 млн кадров RGB-D в 707 различных пространствах. Были приняты стандартные разделения и результаты отчетов для тестового подмножества. Также оценивалось качество переноса базовых нейронных сетей, обученных на ScanNet, на другие датасеты: TUM RGB-D, давно принятый бенчмарк RGB-D SLAM; ICL-NUIM, небольшой бенчмарк реконструкции RGBD с восемью сценами, визуализированными в синтетической структуре, и 7-Scenes [9], небольшой, но сложный датасет RGB-D для помещений, содержащий 7 реальных пространств внутри помещений.The proposed method was trained on ScanNet [8], which contains 1613 indoor scenes with reference camera positions, 3D reconstructions, and segmentation labels. In total, it contains 2.5 million RGB-D frames in 707 different spaces. Standard splits and reporting results for the test subset were adopted. The transfer quality of the baseline neural networks trained on ScanNet to other datasets was also assessed: TUM RGB-D, a long-established RGB-D SLAM benchmark; ICL-NUIM, a small RGBD reconstruction benchmark with eight scenes rendered in a synthetic structure; and 7-Scenes [9], a small but challenging indoor RGB-D dataset containing 7 real indoor spaces.

C. Протокол оценкиC. Evaluation Protocol

Для каждого кадра визуализировалась карта глубины эталона на основе сетки эталона по отношению к соответствующему положению камеры. Аналогичным образом, оценочная карта глубины визуализировалась из оценочной сетки и маскировалась в областях, где глубина эталона недействительна. Все оценочные маскированные карты глубины были интегрированы в один TSDF объем.For each frame, a reference depth map was rendered from the reference grid with respect to the corresponding camera position. Similarly, an estimated depth map was rendered from the estimated grid and masked in areas where the reference depth was invalid. All estimated masked depth maps were integrated into a single TSDF volume.

Во всех экспериментах качество реконструкции оценивалось с помощью стандартных эталонных метрик реконструкции: точность, полнота, прецизионность, отказ и F-показатель с порогом 5 см [2]; сетка, полученная из маскированных оценочных глубин, сравнивалась с сеткой эталонов.In all experiments, reconstruction quality was assessed using standard reconstruction benchmark metrics: accuracy, recall, precision, rejection, and F-score with a threshold of 5 cm [2]; the grid obtained from the masked estimated depths was compared with the benchmark grid.

Таблица I: Качество реконструкции, выполненной с помощью базовых версий и модифицированных методов на ScanNet. Лучшие показатели выделены жирным шрифтом.Table I: Reconstruction quality of the baseline and modified methods on ScanNet. The best values are in bold.

Acc - точность - среднее расстояние от точек спрогнозированной 3D-модели до ближайших точек эталонной модели.Acc - accuracy - the average distance from the points of the predicted 3D model to the nearest points of the reference model.

Comp - полнота - среднее расстояние от основных точек эталона 3D-модели до ближайших точек спрогнозированной модели.Comp - completeness - the average distance from the main points of the 3D model standard to the nearest points of the predicted model.

Prec - прецизионность - процент точек спрогнозированной 3D-модели, для которых расстояние до ближайшей точки из эталонной модели составляет менее 5 см.Prec - precision - the percentage of points in the predicted 3D model for which the distance to the nearest point from the reference model is less than 5 cm.

Recall - отказ - процент точек эталонной модели, для которых расстояние до ближайшей точки от прогнозируемой 3D-модели составляет менее 5 см.Recall - the percentage of reference model points for which the distance to the nearest point from the predicted 3D model is less than 5 cm.

F-показатель - среднее гармоническое значение прецизионности и отказа.F-score is the harmonic mean of precision and rejection.

Стрелки указывают, какое значение лучше (стрелка вниз - чем меньше, тем лучше, стрелка вверх - чем больше, тем лучше)The arrows indicate which value is better (down arrow - the smaller the better, up arrow - the larger the better)

Выводы: Для обоих методов (Atlas, Vortx) добавление регуляризации NSR улучшает качество по всем метрикам.Conclusions: For both methods (Atlas, Vortx), adding NSR regularization improves the quality for all metrics.

Таблица II: Качество реконструкции, выполненной с помощью базовой версии VoRTX и модифицированного метода, обученного на ScanNet, на датасетах ICL-NUIM, TUM RGB-D и 7Scenes. Лучшие показатели для каждого датасета выделены жирным шрифтом.Table II: Reconstruction quality of the baseline VoRTX and the modified method trained on ScanNet on the ICL-NUIM, TUM RGB-D, and 7Scenes datasets. The best performance for each dataset is highlighted in bold.

Выводы: При переносе базовой нейронной сети, обученной на одном датасете, на другие данные, добавление NSR регуляризации также повышает качество практически по всем метрикам (тесты проводились на 3 датасетах: ICL-NUIM, TUM RGB-B, 7Scenes)Conclusions: When transferring a basic neural network trained on one dataset to other data, adding NSR regularization also improves the quality for almost all metrics (tests were conducted on 3 datasets: ICL-NUIM, TUM RGB-B, 7Scenes)

D. ВыводD. Conclusion

В процессе вывода выборки K=60 ключевых кадров брали произвольным образом, чтобы для каждой пары последовательных ключевых кадров относительный поворот составлял не менее 15°, а сдвиг - не менее 10 см. Эти выборочные ключевые кадры использовались для оценки TSDF всей сцены; результирующую сетку сцены получали из TSBF с помощью алгоритма марширующих кубов.During the inference process, K=60 key frames were sampled arbitrarily so that for each pair of consecutive key frames the relative rotation was at least 15° and the translation was at least 10 cm. These sampled key frames were used to estimate the TSDF of the entire scene; the resulting scene mesh was obtained from the TSBF using the marching cubes algorithm.

Выбор ключевых кадров производился на самом начальном этапе, еще до ввода данных в нейронную сеть. Целью этого является уменьшение затрат времени и памяти на алгоритм (обрабатываются не все кадры подряд, а только ключевые, поскольку если кадры расположены очень близко друг к другу (сняты с одинаковых положений камеры), то информация в них дублируется и избыточна).The selection of key frames was made at the very beginning, before entering data into the neural network. The purpose of this is to reduce the time and memory costs of the algorithm (not all frames in a row are processed, but only key ones, since if the frames are located very close to each other (taken from the same camera positions), then the information in them is duplicated and redundant).

Алгоритм марширующих кубов применялся на заключительном этапе к результатам базовой нейронной сети (базовая нейронная сеть возвращает TSDF для всей сцены, а алгоритм марширующих кубов строит из этого TSDF облако точек или сетку).The marching cubes algorithm was applied in the final step to the results of the base neural network (the base neural network returns a TSDF for the entire scene, and the marching cubes algorithm builds a point cloud or mesh from this TSDF).

Е. Детали реализацииE. Implementation Details

Предлагаемая голова сегментации (блок 4 на фиг. 1) обучалась с нуля в течение такого же количества эпох, как и базовая нейронная сеть. Использовался оптимизатор Adam со стандартной начальной скоростью обучения 0,001.The proposed segmentation head (block 4 in Fig. 1) was trained from scratch for the same number of epochs as the base neural network. The Adam optimizer was used with a default initial learning rate of 0.001.

Предлагаемый метод был реализован на платформе PyTorch, а предлагаемая сеть обучалась на одном графическом процессоре NVIDIA Tesla Р40.The proposed method was implemented on the PyTorch platform, and the proposed network was trained on a single NVIDIA Tesla P40 GPU.

Использовались внутренние и внешние параметры, представленные в датасетах [8-11], которые корректировались вместе с масштабированием изображения. Согласно [5], положения камеры инициализировались единицами и представляли исходную SDF как единичную сферу с нормалями поверхности, направленными внутрь.The intrinsic and extrinsic parameters presented in the datasets [8-11] were used and adjusted along with the image scaling. According to [5], the camera positions were initialized to units and represented the original SDF as a unit sphere with surface normals directed inward.

V. РезультатыV. Results

А. Сравнение с известным уровнемA. Comparison with a known level

Метрики реконструкции в ScanNet собраны в Таблице I, а качество переноса 3D-модели в бенчмарки ICL-NUIM, TUM RGB-D и 7Scenes показано в Таблице II.The reconstruction metrics in ScanNet are summarized in Table I, and the 3D model transfer quality in ICL-NUIM, TUM RGB-D, and 7Scenes benchmarks is shown in Table II.

В таблице III проиллюстрировано качество реконструкции, полученное с использованием различных комбинаций потерь в ScanNet. "С" обозначает обычные потери, "N" - потери на нормалях, "NS" - потери на сегментации нормалей.Table III illustrates the reconstruction quality obtained using different combinations of losses in ScanNet. "C" stands for the normal loss, "N" for the normal loss, and "NS" for the normal segmentation loss.

Выводы: добавление отдельных членов к общей функции потерь повышает качество 3D-модели, и наилучшие результаты можно получить, если добавить все предложенные в изобретении модификации (функции потерь как для нормалей, так и для сегментации, и классические потери нормалей).Conclusions: Adding individual terms to the overall loss function improves the quality of the 3D model, and the best results can be obtained by adding all the modifications proposed in the invention (loss functions for both normals and segmentation, and classical normal losses).

На фиг. 2 представлена визуализация результатов 3D-реконструкции. Фиг. 2 изображает визуализацию 3D-реконструкций, полученных с помощью базовых версий VoRTX (а) и VoRTX+NSR (b). Из таблицы видно, что добавление NSR к базовому методу заметно повышает качество. Проблемные области увеличены, чтобы подчеркнуть преимущества использования NSR. Как видно, предложенная модификация позволила заполнить дыры в плоских поверхностях, тогда как базовые версии создали неполную реконструкцию с недостающими точками в этих областях. Более того, в отличие от базовых версий, NSR способствует тому, чтобы стены и пол были ровными и плоскими.Fig. 2 shows the visualization of the 3D reconstruction results. Fig. 2 shows the visualization of the 3D reconstructions obtained using the basic versions of VoRTX (a) and VoRTX+NSR (b). The table shows that adding NSR to the basic method significantly improves the quality. The problem areas are enlarged to highlight the benefits of using NSR. As can be seen, the proposed modification allowed filling holes in flat surfaces, while the basic versions created an incomplete reconstruction with missing points in these areas. Moreover, unlike the basic versions, NSR helps to ensure that the walls and floor are smooth and flat.

Анализировалось влияние применения каждой составляющей NSR на качество реконструкции. Все абляционные исследования проводились с использованием VoRTX [1] в качестве базовой версии и тех же протоколов обучения/оценки, которые были описаны выше, если не указано иное. Сначала исследовалось качество оценочной сегментации путем использования эталонных аннотаций сегментации вместо спрогнозированных.The impact of using each NSR component on the reconstruction quality was analyzed. All ablation studies were performed using VoRTX [1] as the baseline and the same training/evaluation protocols described above unless otherwise stated. First, the quality of the estimated segmentation was investigated by using reference segmentation annotations instead of predicted ones.

Затем исследовались базовые нейронные сети, обученные с использованием различных комбинаций обычных потерь нормалей и потерь сегментации нормалей, чтобы убедиться, что качество реконструкции улучшилось при использовании предлагаемых потерь.Then, baseline neural networks trained using different combinations of regular normal losses and normal segmentation losses were examined to verify that the reconstruction quality was improved when using the proposed losses.

Таким образом, предлагается NSR, являющаяся модификацией методов реконструкции сцены, которая позволяет рассматривать типичную структуру сцены. В частности, предлагаемая модификация определяет стены и пол в облаке точек и штрафует соответствующие нормали поверхности за отклонение от горизонтального и вертикального направлений, соответственно. Реализованная в виде разреженного 3D сверточного модуля, NSR может быть включена в произвольную обучаемую модель, которая выдает на выходе TSBF, и подвергаться обучению от начала до конца на облаках точек 3D сегментации. В процессе вывода 3D-сегментация не требуется, поэтому использование NSR не накладывает никаких ограничений на пользовательские сценарии. Предлагаемую модификацию применяли в нескольких современных методах реконструкции TSDF, и был продемонстрирован значительный прирост производительности на стандартным датасетах: ScanNet, ICL-NUIM и TUM RGB-D.In summary, we propose NSR, a modification of scene reconstruction methods that allows considering the typical structure of a scene. In particular, the proposed modification identifies walls and floors in a point cloud and penalizes the corresponding surface normals for deviation from the horizontal and vertical directions, respectively. Implemented as a sparse 3D convolutional module, NSR can be included in an arbitrary trainable model that outputs a TSBF and trained end-to-end on 3D segmentation point clouds. No 3D segmentation is required in the inference process, so the use of NSR does not impose any restrictions on user scenarios. The proposed modification has been applied to several state-of-the-art TSDF reconstruction methods and has demonstrated significant performance gains on standard datasets: ScanNet, ICL-NUIM, and TUM RGB-D.

Описанные выше иллюстративные варианты осуществления являются примерами и не должны рассматриваться как ограничивающие. Кроме того, описание этих вариантов осуществления предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и многие альтернативы, модификации и варианты будут очевидны для специалистов в данной области техники.The illustrative embodiments described above are examples and should not be considered as limiting. Furthermore, the description of these embodiments is intended to illustrate and not to limit the scope of the claims, and many alternatives, modifications and variations will be obvious to those skilled in the art.

ЛИТЕРАТУРАLITERATURE

[1] N. Stier, A. Rich, P. Sen, and Т. H"ollerer, "VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion," in 3DV, 2021, pp.320-330.[1] N. Stier, A. Rich, P. Sen, and T. H"ollerer, "VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion," in 3DV, 2021, pp.320-330.

[2] Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich, "Atlas: End-to-end 3D scene reconstruction from posed images," in ECCV, Glasgow, UK, 2020, pp.414-431.[2] Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich, "Atlas: End-to-end 3D scene reconstruction from posed images," in ECCV, Glasgow, UK, 2020, pp.414-431.

[3] J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, "NeuralRecon: Real-time coherent 3D reconstruction from monocular video," in CVPR, 2021, pp.15598-15607.[3] J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, "NeuralRecon: Real-time coherent 3D reconstruction from monocular video," in CVPR, 2021, pp.15598-15607.

[4] A. Bovzivc, P. Palafox, J. Thies, A. Dai, and Matthias Niesner, "TransformerFusion: Monocular RGB scene reconstruction using transformers," in NeurlPS, 2021.[4] A. Bovzivc, P. Palafox, J. Thies, A. Dai, and Matthias Niesner, “TransformerFusion: Monocular RGB scene reconstruction using transformers,” in NeurlPS, 2021.

[5] H. Guo, S. Peng, H. Lin, Q. Wang, G. Zhang, H. Bao, and X. Zhou, "Neural 3D Scene Reconstruction with the Manhattan-world Assumption," in CVPR, 2022.[5] H. Guo, S. Peng, H. Lin, Q. Wang, G. Zhang, H. Bao, and X. Zhou, “Neural 3D Scene Reconstruction with the Manhattan-world Assumption,” in CVPR, 2022.

[6] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, "Volume rendering of neural implicit surfaces," in NeurlPS, 2021.[6] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” in NeurlPS, 2021.

[7] P. Wang, L. Liu, Y. Liu, Ch. Theobalt, T. Komura, and W. Wang, "NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction," in NeurlPS, 2021.[7] P. Wang, L. Liu, Y. Liu, Ch. Theobalt, T. Komura, and W. Wang, "NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction," in NeurlPS, 2021.

[8] A. Dai, A. X. Chang, M. Savva, M. Halber, Th. Funkhouser, and M. Niesner, "ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes," in CVPR, 2017.[8] A. Dai, A. X. Chang, M. Savva, M. Halber, Th. Funkhouser, and M. Niesner, "ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes," in CVPR, 2017.

[9] J. Shotton, B. Glocker, Ch. Zach, Sh. Izadi, A. Criminisi, and A. Fitzgibbon, "Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images," in CVPR, 2013.[9] J. Shotton, B. Glocker, Ch. Zach, Sh. Izadi, A. Criminisi, and A. Fitzgibbon, "Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images," in CVPR, 2013.

[10] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison, "A Benchmark for RGB-B Visual Odometry, 3D Reconstruction and SLAM," in ICRA, Hong Kong, China, 2014.[10] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison, "A Benchmark for RGB-B Visual Odometry, 3D Reconstruction and SLAM," in ICRA, Hong Kong, China, 2014.

[11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and B. Cremers, "A Benchmark for the Evaluation of RGB-B SLAM Systems," in IROS, 2012.[11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and B. Cremers, “A Benchmark for the Evaluation of RGB-B SLAM Systems,” in IROS, 2012.

[12] J. Wang, P. Wang, X. Long, C. Theobalt, T. Komura, L. Liu, W. Wang, "NeuRIS: Neural Reconstruction of Indoor Scenes Using Normal Priors," in ECCV, 2022.[12] J. Wang, P. Wang, X. Long, C. Theobalt, T. Komura, L. Liu, W. Wang, “NeuRIS: Neural Reconstruction of Indoor Scenes Using Normal Priors,” in ECCV, 2022.

[13] A. Avetisyan, M. Bahnert, A. Dai, M. Savva, A. X. Chang, and M. Niesner, "Scan2cad: Learning CAB Model Alignment in RGB-B Scans," in CVPR, 2019.[13] A. Avetisyan, M. Bahnert, A. Dai, M. Savva, A. X. Chang, and M. Niesner, “Scan2cad: Learning CAB Model Alignment in RGB-B Scans,” in CVPR, 2019.

[14] A. Avetisyan, A. Dai, and M. Niesner, "End-To-End CAD Model Retrieval and 9dof Alignment in 3D Scans," in ICCV, 2019.[14] A. Avetisyan, A. Dai, and M. Niesner, “End-To-End CAD Model Retrieval and 9dof Alignment in 3D Scans,” in ICCV, 2019.

[15] M. Dahnert, A. Dai, L. Guibas, and M. Niesner, "Joint Embedding of 3D Scan and CAD Objects," in ICCV, 2019.[15] M. Dahnert, A. Dai, L. Guibas, and M. Niesner, “Joint Embedding of 3D Scan and CAD Objects,” in ICCV, 2019.

[16] Sh. Hampali, S. Stekovic, S. Deb Sarkar, Ch. S. Kumar, F. Fraundorfer, and V. Lepetit, "Monte Carlo Scene Search for 3D Scene Understanding," in CVPR, 2021.[16] Sh. Hampali, S. Stekovic, S. Deb Sarkar, Ch. S. Kumar, F. Fraundorfer, and V. Lepetit, "Monte Carlo Scene Search for 3D Scene Understanding," in CVPR, 2021.

[17] S. Ainetter, S. Stekovic, F. Fraundorfer, V. Lepetit, "Automatically Annotating Indoor Images With CAD Models via RGB-D Scans," in WACV, 2023, pp. 3156-3164.[17] S. Ainetter, S. Stekovic, F. Fraundorfer, V. Lepetit, “Automatically Annotating Indoor Images With CAD Models via RGB-D Scans,” in WACV, 2023, pp. 3156-3164.

[18] Y. Wang, Z. Li, Y. Jiang, K. Zhou, T. Cao, Y. Fu, C. Xiao, "Neural- Room: Geometry-Constrained Neural Implicit Surfaces for Indoor Scene Reconstruction,", in ACM Transactions on Graphics (TOG), 2022.[18] Y. Wang, Z. Li, Y. Jiang, K. Zhou, T. Cao, Y. Fu, C. Xiao, "Neural-Room: Geometry-Constrained Neural Implicit Surfaces for Indoor Scene Reconstruction," in ACM Transactions on Graphics (TOG), 2022.

[19] Sh. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, "SceneCode: Monocular Dense Segmentation Reconstruction Using Learned Encoded Scene Representations," in CVPR, 2019, pp. 11768-11777.[19] Sh. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, "SceneCode: Monocular Dense Segmentation Reconstruction Using Learned Encoded Scene Representations," in CVPR, 2019, pp. 11768-11777.

Claims

1. A method for reconstructing a 3D scene and visualizing it using a neural network consisting of a base neural network including a skeleton, which is the main part of the base neural network that calculates features, a TSDF (Truncated Signed distance function) head that predicts TSDF values in each voxel, and a segmentation head, which is a 3D sparse convolutional segmentation module that predicts segmentation labels in each voxel, comprising the steps of:

train a basic neural network to obtain volume TSDFs for scene voxels as follows:

- input training data, including a training sequence of RGB frames with corresponding camera position data, into the skeleton;

- calculate the TSDF loss between the TSDF prediction obtained by the TSDF head from the skeleton output and the reference scan;

- obtain, using the segmentation head, from the output of the skeleton data a segmentation prediction defining the "floor", "walls", "other" regions for each voxel;

- calculate the segmentation loss between the segmentation forecast and the segmentation standard;

- calculate normal coordinates as gradients of the TSDF prediction for the TSDF values in each voxel over all voxels in the scene;

- calculate normal losses for normals for wall regions and floor regions based on the TSDF reference results and TSDF prediction gradients;

- calculate the overall loss function as the sum of the TSDF loss, segmentation loss, and normal loss;

- minimize the overall loss function;

use a trained base neural network to obtain the TSDF of the volume of the input sequence of RGB frames of a real scene;

apply an algorithm to the TSDF volume that computes the reconstruction of the 3D scene;

render the reconstruction of the 3D scene to obtain a visualization of the 3D scene.

2. The method according to claim 1, wherein the algorithm that calculates the reconstruction of the 3D scene is the marching cubes algorithm.

3. The method according to any one of claims 1, 2, wherein the segmentation head and the TSDF head are used in parallel during training.

4. The method according to any of paragraphs 1-3, containing additional stages implemented after the stage of calculating the coordinates of the normals, in which:

select the normal in the wall areas and calculate their vertical components;

select the normal in the floor areas and calculate their horizontal components.

5. The method according to any of paragraphs 1-4, in which at the stage of calculating the usual losses of normals the following is performed:

at each point in the areas where the segmentation head predicts a "wall", the vertical components of the normal vectors are considered, and the length of the z-component of the normal vector is added to the usual normal losses,

and at each point in the regions where the segmentation head predicts "floor", consider the horizontal components of the normal vectors, and also consider the x- and y-components of the normal vector and add the norm of the two-dimensional vector consisting of these two components to the usual normal loss.

6. The method according to any one of paragraphs 1-5, in which the overall loss function is calculated as the sum of the TSDF loss, the segmentation loss and the usual normal loss.

7. The method according to any one of paragraphs 1-6, in which the minimization of the general loss function is implemented by calculating the gradient of the general loss function for all parameters of the base neural network.

8. The method according to any one of claims 1-7, further comprising providing backpropagation of the error of the minimized overall loss function and updating the parameters of the base neural network in accordance with the minimized loss function, while updating the parameters of the base neural network.

9. The method according to any one of paragraphs 1-8, in which the training steps are repeated until the overall loss function no longer decreases.

10. The method according to any one of claims 1-8, in which the training steps are repeated until the overall loss function reaches a given threshold value.

11. A computing device comprising a processor and memory in which instructions are stored that, when executed in the processor, implement the method according to paragraph 1.