KR102297103B1

KR102297103B1 - Method and apparatus for generating 3d scene graph

Info

Publication number: KR102297103B1
Application number: KR1020200020163A
Authority: KR
Inventors: 김종환; 김의환; 박진만; 송택진
Original assignee: 한국과학기술원
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2021-09-03
Also published as: KR102297103B9; KR20210105550A

Abstract

본 발명은 3차원 장면 그래프 생성 방법 및 장치에 관한 것으로서, 본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 방법은 일련의 입력 이미지 프레임에 대한 전처리를 수행하는 (a) 단계와 상기 전처리된 이미지 프레임 중 불필요한 이미지 프레임을 제거하는 (b) 단계와 상기 전처리된 이미지 프레임의 클래스를 분류하는 (c) 단계와 상기 분류 결과에 기초하여 키프레임 그룹을 추출하는 (d) 단계와 상기 키프레임 그룹에 대한 오류 관계를 감지하여 제거하는 (e) 단계와 상기 오류 관계가 제거된 프레임에 기초하여 3차원 장면 그래프를 생성하는 (f) 단계를 포함할 수 있다. The present invention relates to a method and apparatus for generating a three-dimensional scene graph, and the method for generating a three-dimensional scene graph according to an embodiment of the present invention includes the steps of (a) performing pre-processing on a series of input image frames and the pre-processed image (b) removing unnecessary image frames from among the frames, (c) classifying the class of the pre-processed image frame, (d) extracting a keyframe group based on the classification result, and adding to the keyframe group The method may include (e) detecting and removing an erroneous relationship with respect to each other and (f) generating a 3D scene graph based on the frame from which the erroneous relationship is removed.

Description

Method and apparatus for generating a three-dimensional scene graph {METHOD AND APPARATUS FOR GENERATING 3D SCENE GRAPH}

본 발명은 3차원 장면 그래프 생성에 관한 것으로, 보다 상세하게는 3차원 환경 모델 구축에 필요한 메모리 사용 및 점유율과 계산량을 최소화하여 타겟 환경을 빠르고 정확하게 인식하기 위한 3차원 장면 그래프를 생성하는 기술에 관한 것이다.The present invention relates to three-dimensional scene graph generation, and more particularly, to a technology for generating a three-dimensional scene graph for quickly and accurately recognizing a target environment by minimizing memory usage, occupancy, and calculation amount required for building a three-dimensional environment model. will be.

장면에 대한 포괄적이고 의미적인 이해는 많은 응용 분야에서 중요하다.A comprehensive and semantic understanding of a scene is important for many applications.

특히, 지능형 서비스 개발에 있어서, 정확한 환경 인식은 핵심적인 역할을 수행한다. In particular, in the development of intelligent services, accurate environmental awareness plays a key role.

로봇과 같은 지능형 에이전트는 주어진 작업을 수행하기 이전에 자신의 주변 환경 내에서 정보를 수집하고 의미를 인식한다.An intelligent agent, such as a robot, gathers information within its environment and recognizes its meaning before performing a given task.

지능형 에이전트는 수집된 정보를 바탕으로 주변 환경을 단순한 구조의 환경 모델로 변환하여 저장한다.The intelligent agent converts the surrounding environment into an environment model with a simple structure based on the collected information and stores it.

특정 공간에 대한 다양한 의미론적 정보-예를 들면, 객체, 장면 범주, 재료 유형, 3D 모양 등-를 추출하고, 그것의 구조를 분석함으로써 3D 환경 모델을 구축할 수 있다.A 3D environment model can be built by extracting various semantic information about a specific space - for example, objects, scene categories, material types, 3D shapes, etc. - and analyzing its structure.

3차원 포인트 클라우드(3D Point Cloud)나 써펠 모델(Surfel Model)과 같은 기존 환경 모델은 구축에 필요한 계산량과 메모리 사용량이 높고, 의미 정보나 물리 정보 둘 중 어느 하나만 포함하는 경우가 많다.Existing environmental models, such as 3D point clouds and Surfel models, require a high amount of computation and memory to be built, and often include either semantic information or physical information.

따라서, 3차원 환경 모델 구축에 있어서, 계산량과 메모리 사용량이 적고, 다양한 환경 정보를 풍부하게 포함하며, 사용성과 확장성이 높은 환경 인식 모델의 개발이 요구되고 있다.Therefore, in constructing a three-dimensional environment model, there is a need to develop an environment recognition model that has a small amount of computation and memory usage, includes a variety of environment information abundantly, and has high usability and scalability.

본 발명의 일 실시 예는 3차원 장면 그래프 생성 방법 및 장치를 제공하고자 한다.An embodiment of the present invention is to provide a method and apparatus for generating a 3D scene graph.

본 발명의 다른 실시 예는 계산량과 메모리 사용량/점유율이 적고, 다양한 환경 정보를 풍부하게 포함하며, 광범위한 적용성, 직관적인 사용성 및 높은 확장성을 가지는 다목적 3차원 환경 모델을 제공하는 것이 가능한 3차원 장면 그래프 생성 방법 및 장치를 제공하고자 한다.Another embodiment of the present invention is a three-dimensional (3D) model capable of providing a multi-purpose three-dimensional environment model with a small amount of computation and memory usage/occupancy, richly including various environmental information, and having wide applicability, intuitive usability, and high scalability. An object of the present invention is to provide a method and apparatus for generating a scene graph.

본 발명의 또 다른 실시 예는 노드(Vertex)와 선(Edge)을 통해서 물리 정보와 의미 정보를 정확하고 풍부하게 제공하는 것이 가능한 3차원 환경 모델을 제공하는 것이 가능한 3차원 장면 그래프 생성 방법 및 장치를 제공하고자 한다.Another embodiment of the present invention provides a 3D scene graph generating method and apparatus capable of providing a 3D environment model capable of accurately and abundantly providing physical information and semantic information through nodes (vertices) and lines (edges) would like to provide

본 발명의 또 다른 실시 예는 로봇이 주변 환경을 빠르고, 정확하고, 간결하게 인식할 수 있는 3차원 장면 그래프를 제공하고자 한다.Another embodiment of the present invention is to provide a 3D scene graph in which the robot can quickly, accurately, and concisely recognize the surrounding environment.

본 발명의 또 다른 실시 예는 시각 장애인이 주변 환경을 보다 정확하게 인식할 수 있는 3차원 장면 그래프를 제공하고자 한다.Another embodiment of the present invention is to provide a three-dimensional scene graph through which the visually impaired can more accurately recognize the surrounding environment.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재들로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 방법은 일련의 입력 이미지 프레임에 대한 전처리를 수행하는 (a) 단계와 상기 전처리된 이미지 프레임 중 불필요한 이미지 프레임을 제거하는 (b) 단계와 상기 전처리된 이미지 프레임의 클래스를 분류하는 (c) 단계와 상기 분류 결과에 기초하여 키프레임 그룹을 추출하는 (d) 단계와 상기 키프레임 그룹에 대한 오류 관계를 감지하여 제거하는 (e) 단계와 상기 오류 관계가 제거된 프레임에 기초하여 3차원 장면 그래프를 생성하는 (f) 단계를 포함할 수 있다.The method for generating a 3D scene graph according to an embodiment of the present invention includes the steps of (a) performing pre-processing on a series of input image frames, (b) removing unnecessary image frames among the pre-processed image frames, and the pre-processing (c) classifying the class of the image frame, (d) extracting a keyframe group based on the classification result, (e) detecting and removing an error relationship for the keyframe group, and the error It may include (f) generating a 3D scene graph based on the frame from which the relationship is removed.

실시 예로, 상기 (b) 단계는, 라플라스 분산(variance of Laplacian)을 산출하여 흐릿한 정도(blurriness) V를 산출하는 단계와 상기 V에 대한 지수 이동 평균 St를 산출하는 단계와 상기 St에 대한 바이어스 보정을 수행하는 단계와 상기 바이어스 보정된 지수 이동 평균에 기반하여 임계 값을 결정하는 단계와 상기 V와 상기 임계 값을 비교하여 제거 대상 이미지 프레임을 결정하는 단계를 포함할 수 있다.In an embodiment, the step (b) includes the steps of calculating the variance of Laplacian to calculate the blurriness V, calculating the exponential moving average St for the V, and correcting the bias for the St and determining a threshold value based on the bias-corrected exponential moving average, and determining an image frame to be removed by comparing V with the threshold value.

실시 예로, 상기 클래스는 키프레임, 앵커프레임 및 가비지프레임을 포함하고, 상기 키프레임은 레퍼런스 프레임으로 첫번째 앵커프레임 및 상기 키프레임 그룹의 적용 범위를 결정하기 위해 사용되고, 상기 앵커프레임은 활성 또는 비활성화될 수 있으며, 활성화된 앵커프레임은 다음 앵커프레임을 결정하기 위한 사용되고, 상기 가비지프레임은 폐기되는 프레임일 수 있다. In an embodiment, the class includes a keyframe, an anchorframe and a garbage frame, the keyframe is used as a reference frame to determine the coverage of a first anchorframe and the keyframe group, and the anchorframe is activated or deactivated. The activated anchor frame may be used to determine the next anchor frame, and the garbage frame may be a discarded frame.

실시 예로, 상기 (c) 단계는, 현재 키프레임과 새롭게 입력된 이미지 프레임 사이의 오버랩 비율인 제1 값을 산출하는 단계와 상기 제1 값이 제1 기준치 이하이면 상기 새롭게 입력된 이미지 프레임을 다음 키프레임으로 결정하는 단계를 포함할 수 있다.For example, the step (c) includes calculating a first value that is an overlap ratio between the current keyframe and the newly input image frame, and if the first value is less than or equal to the first reference value, the newly input image frame is selected as the next It may include determining the keyframe.

실시 예로, 상기 (c) 단계는, 상기 제1 값이 상기 제1 기준치를 초과하면, 현재 활성화된 앵커프레임과 상기 새롭게 입력된 이미지 프레임 사이의 오버랩 비율인 제2 값을 산출하는 단계와 상기 제2 값이 제2 기준치 이하이면, 상기 새롭게 입력된 이미지 프레임을 새로운 앵커프레임으로 결정하는 결정하는 단계와 상기 제2 값이 제2 기준치를 초과하면, 상기 새롭게 입력된 이미지 프레임을 상기 가비지프레임으로 분류하여 폐기하는 단계를 더 포함할 수 있다.For example, in the step (c), if the first value exceeds the first reference value, calculating a second value that is an overlap ratio between the currently activated anchor frame and the newly input image frame; If the value 2 is less than or equal to the second reference value, determining that the newly input image frame is a new anchor frame; if the second value exceeds the second reference value, classifying the newly input image frame as the garbage frame It may further include the step of discarding.

실시 예로, 상기 3차원 장면 그래프 생성 방법은 상기 (b) 단계와 상기 (c) 단계 사이에 포즈를 추출하는 단계를 더 포함하고, 시각적 거리 측정법(visual odometry) 또는 동시적 위치 추정 및 지도 작성(SLAM: Simultaneous localization and mapping) 기법을 사용하여 주변 이미지 프레임들 사이의 상대 포즈가 추출될 수 있다.In an embodiment, the method for generating a 3D scene graph further comprises the step of extracting a pose between the steps (b) and (c), using visual odometry or simultaneous localization and map creation ( Relative poses between neighboring image frames may be extracted using a SLAM: Simultaneous localization and mapping technique.

실시 예로, 상기 추출된 상대 포즈들은 상기 키프레임 그룹 내 인식된 객체의 물리적 속성을 추출하는데 사용될 수 있다.In an embodiment, the extracted relative poses may be used to extract a physical property of an object recognized in the keyframe group.

실시 예로, 상기 (e) 단계는, 상기 추출된 키프레임 그룹 내 포함된 키프레임에 상응하는 객체 영역을 추출하는 단계와 상기 추출된 객체 영역 중 중복 객체 영역을 제거하여 유효한 객체 영역을 식별하는 단계와 상기 식별된 객체 영역 내 인식된 객체 중 미리 정의된 관련 없는 객체를 제거하여 유효한 객체를 식별하는 단계와 상기 식별된 객체 사이의 객체 관계를 필터링하여 불필요한 객체 관계를 제거하는 단계를 포함할 수 있다.In an embodiment, the step (e) may include extracting an object region corresponding to a keyframe included in the extracted keyframe group, and identifying a valid object region by removing a duplicate object region from among the extracted object regions. and identifying valid objects by removing predefined unrelated objects among recognized objects in the identified object area, and removing unnecessary object relationships by filtering object relationships between the identified objects. .

실시 예로, 상기 (f) 단계는, 상기 입력 이미지 프레임으로부터 부분적인 객체 속성 및 객체 쌍간의 관계를 추출하여 2차원 버전의 장면 그래프를 형성한 후 2차원 속성들을 3차원 속성들로 변환하여 지역 3차원 장면 그래프를 생성하는 단계와 상기 지역 3차원 장면 그래프를 현재 전역 3차원 장면 그래프에 병합하여 전역 3차원 장면 그래프를 갱신하는 단계를 포함할 수 있다.For example, in step (f), a two-dimensional version of a scene graph is formed by extracting partial object properties and relationships between object pairs from the input image frame, and then converting the two-dimensional properties into three-dimensional properties to form a region 3 It may include generating a 3D scene graph and updating the global 3D scene graph by merging the local 3D scene graph into a current global 3D scene graph.

실시 예로, 상기 전역 3차원 장면 그래프를 갱신하는 단계는, 상기 전역 3차원 장면 그래프의 오리지널 노드와 새로 입력된 이미지 프레임의 후보 노드 사이의 라벨 유사도(Label similarity), 위치 유사도 Position similarity) 및 색상 유사도(Color similarity)에 기반하여 동일 노드를 감지하는 단계를 포함할 수 있다.In an embodiment, the updating of the global 3D scene graph may include label similarity, position similarity, and color similarity between an original node of the global 3D scene graph and a candidate node of a newly input image frame. It may include detecting the same node based on (Color similarity).

실시 예로, 상기 전역 3차원 장면 그래프는 인식된 객체를 표현하기 위한 노드(Node) 및 객체 쌍간의 관계를 표현하기 위한 에지(Edge)를 포함할 수 있다. In an embodiment, the global 3D scene graph may include a node for expressing a recognized object and an edge for expressing a relationship between a pair of objects.

실시 예로, 상기 노드는, 해당 환경 내 인식된 객체에 상응하여 고유하게 할당되는 번호인 식별 번호와 객체의 카테고리를 표시하기 위한 시멘틱 라벨(Semantic Label)을 포함하고, 상기 에지는, 한 객체가 다른 객체를 향한 행동을 보여주는 동작 관계와 거리 및 상대적인 위치와 같은 공간적인 관계를 보여주는 공간 관계와 다른 객체와 관련된 한 객체의 상태를 보여주는 묘사 관계와 전치사로 표현되는 두 객체 사이의 의미적 관계를 보여주는 전치 관계와 다른 객체와 비교되는 한 객체의 상대적인 속성을 보여주는 비교 관계 중 적어도 하나를 포함할 수 있다.In an embodiment, the node includes an identification number, which is a number uniquely assigned to an object recognized in the corresponding environment, and a semantic label for indicating the category of the object, and the edge is A motion relation showing behavior towards an object, a spatial relation showing spatial relations such as distance and relative position, a descriptive relation showing the state of one object in relation to another object, and a preposition showing a semantic relation between two objects expressed as a preposition It may include at least one of a relationship and a comparison relationship showing a relative property of one object compared to another object.

실시 예로, 상기 전역 3차원 장면 그래프상에 객체 쌍간의 주종 관계가 화살표 방향으로 표시될 수 있다.In an embodiment, a master-slave relationship between object pairs may be displayed in the direction of an arrow on the global 3D scene graph.

실시 예로, 상기 노드는 해당 객체에 대한 물리적 속성 및 시각적 속성을 더 포함할 수 있다.In an embodiment, the node may further include physical properties and visual properties of the corresponding object.

본 발명의 다른 실시 예에 따른 3차원 장면 그래프 생성 장치는 일련의 입력 이미지 프레임에 대한 전처리를 수행하는 이미지 전처리 모듈과 상기 전처리된 이미지 프레임 중 불필요한 이미지 프레임을 제거하는 이미지 제거 모듈과 상기 전처리된 이미지 프레임의 클래스를 분류하여 키프레임 그룹을 추출하는 키프레임 그룹 추출 모듈과 상기 키프레임 그룹에 대한 오류 관계를 감지하여 제거하는 오류 관계 제거 모듈과 상기 오류 관계가 제거된 프레임에 기초하여 3차원 장면 그래프를 생성하는 그래프 생성 모듈을 포함할 수 있다.An apparatus for generating a 3D scene graph according to another embodiment of the present invention includes an image pre-processing module for pre-processing a series of input image frames, an image removal module for removing unnecessary image frames from among the pre-processed image frames, and the pre-processed image A three-dimensional scene graph based on a keyframe group extraction module that classifies the class of frames to extract a keyframe group, an error relationship removal module that detects and removes an error relationship with respect to the keyframe group, and a frame from which the error relationship is removed It may include a graph generation module for generating

실시 예로, 상기 이미지 제거 모듈은 라플라스 분산(variance of Laplacian)을 산출하여 흐릿한 정도(blurriness) V를 산출하는 수단과 상기 V에 대한 지수 이동 평균 St를 산출하는 수단과 상기 St에 대한 바이어스 보정을 수행하는 수단과 상기 바이어스 보정된 지수 이동 평균에 기반하여 임계 값을 결정하는 수단과 상기 V와 상기 임계 값을 비교하여 제거 대상 이미지 프레임을 결정하는 수단을 포함할 수 있다.In an embodiment, the image removal module includes a means for calculating a blurriness V by calculating a variance of Laplacian, a means for calculating an exponential moving average St for V, and bias correction for St. and means for determining a threshold value based on the bias-corrected exponential moving average, and means for determining an image frame to be removed by comparing the V with the threshold value.

실시 예로, 상기 키프레임 그룹 추출 모듈은 현재 키프레임과 새롭게 입력된 이미지 프레임 사이의 오버랩 비율인 제1 값을 산출하는 수단과 상기 제1 값이 제1 기준치 이하이면 상기 새롭게 입력된 이미지 프레임을 다음 키프레임으로 결정하는 수단을 포함할 수 있다.In an embodiment, the keyframe group extraction module includes means for calculating a first value that is an overlap ratio between a current keyframe and a newly input image frame, and if the first value is less than or equal to a first reference value, the newly input image frame is selected as the next Means for determining keyframes may be included.

실시 예로, 상기 키프레임 그룹 추출 모듈은, 상기 제1 값이 상기 제1 기준치를 초과하면, 현재 활성화된 앵커프레임과 상기 새롭게 입력된 이미지 프레임 사이의 오버랩 비율인 제2 값을 산출하는 수단과 상기 제2 값이 제2 기준치 이하이면, 상기 새롭게 입력된 이미지 프레임을 새로운 앵커프레임으로 결정하는 결정하는 수단과 상기 제2 값이 제2 기준치를 초과하면, 상기 새롭게 입력된 이미지 프레임을 상기 가비지프레임으로 분류하여 폐기하는 수단을 더 포함할 수 있다.In an embodiment, the keyframe group extraction module includes: means for calculating a second value that is an overlap ratio between the currently activated anchor frame and the newly input image frame when the first value exceeds the first reference value; If the second value is less than or equal to the second reference value, a determining means for determining the newly input image frame as a new anchor frame; and if the second value exceeds the second reference value, the newly input image frame is designated as the garbage frame. It may further include means for sorting and discarding.

실시 예로, 상기 3차원 장면 그래프 생성 장치는 상기 전처리된 이미지 프레임에 기초하여 포즈를 추출하는 포즈 추정 모듈을 더 포함하고, 상기 포즈 추정 모듈은 시각적 거리 측정법(visual odometry) 또는 동시적 위치 추정 및 지도 작성(SLAM: Simultaneous localization and mapping) 기법을 사용하여 주변 이미지 프레임들 사이의 상대 포즈를 추출할 수 있다.In an embodiment, the 3D scene graph generating apparatus further comprises a pose estimation module for extracting a pose based on the pre-processed image frame, wherein the pose estimation module is configured to perform visual odometry or simultaneous location estimation and map Simultaneous localization and mapping (SLAM) techniques can be used to extract relative poses between surrounding image frames.

실시 예로, 상기 3차원 장면 그래프 생성 장치는 해당 이미지 프레임에 상응하는 객체 영역을 추출하여 제안하는 영역 제안부와 상기 객체 영역 내 포함된 객체를 인식하는 객체 인식부와 상기 인식된 객체 쌍간의 관계를 추출하는 관계 추출부를 포함하는 신경망을 더 포함할 수 있다.In an embodiment, the 3D scene graph generating apparatus determines the relationship between a region suggestion unit that extracts and proposes an object region corresponding to a corresponding image frame, an object recognizer that recognizes an object included in the object region, and the recognized object pair It may further include a neural network including a relation extractor to extract.

실시 예로, 상기 오류 관계 제거 모듈은 상기 추출된 객체 영역 중 중복 객체 영역을 제거하여 유효한 객체 영역을 식별하는 중첩 객체 영역 제거부와 상기 식별된 객체 영역 내 인식된 객체 중 미리 정의된 관련 없는 객체를 제거하여 유효한 객체를 식별하는 비관련 객체 제거부와 상기 식별된 객체 사이의 객체 관계를 필터링하여 불필요한 객체 관계를 제거하는 객체 관계 필터링를 포함할 수 있다.In an embodiment, the erroneous relationship removal module includes an overlapping object area removing unit for identifying a valid object area by removing a duplicate object area from among the extracted object areas, and a predefined unrelated object among the recognized objects within the identified object area. The method may include an unrelated object removal unit that identifies valid objects by removing them, and object relationship filtering that removes unnecessary object relationships by filtering object relationships between the identified objects.

실시 예로, 상기 그래프 생성 모듈은 상기 입력 이미지 프레임으로부터 부분적인 객체 속성 및 객체 쌍간의 관계를 추출하여 2차원 버전의 장면 그래프를 형성한 후 2차원 속성들을 3차원 속성들로 변환하여 지역 3차원 장면 그래프를 생성하는 지역 그래프 생성부와 상기 지역 3차원 장면 그래프를 현재 전역 3차원 장면 그래프에 병합하여 전역 3차원 장면 그래프를 갱신하는 전역 그래프 생성부를 포함할 수 있다.In an embodiment, the graph generating module extracts partial object properties and relationships between object pairs from the input image frame to form a two-dimensional version of the scene graph, then converts the two-dimensional properties into three-dimensional properties to form a local three-dimensional scene It may include a regional graph generation unit for generating a graph, and a global graph generation unit for updating the global 3D scene graph by merging the regional 3D scene graph into the current global 3D scene graph.

실시 예로, 상기 전역 그래프 생성부는 상기 전역 3차원 장면 그래프의 오리지널 노드와 새로 입력된 이미지 프레임의 후보 노드 사이의 라벨 유사도(Label similarity), 위치 유사도 Position similarity) 및 색상 유사도(Color similarity)에 기반하여 동일 노드를 감지할 수 있다.In an embodiment, the global graph generating unit is based on label similarity, position similarity, and color similarity between the original node of the global 3D scene graph and a candidate node of a newly input image frame. The same node can be detected.

실시 예로, 상기 전역 3차원 장면 그래프는 인식된 객체를 표현하기 위한 노드(Node) 및 객체 쌍간의 관계를 표현하기 위한 에지(Edge)를 포함하고, 상기 노드는, 해당 환경 내 인식된 객체에 상응하여 고유하게 할당되는 번호인 식별 번호와 객체의 카테고리를 표시하기 위한 시멘틱 라벨(Semantic Label)와 객체의 물리적 속성과 객체의 시각적 특성 중 적어도 하나를 포함하고, 상기 에지는, 한 객체가 다른 객체를 향한 행동을 보여주는 동작 관계와 거리 및 상대적인 위치와 같은 공간적인 관계를 보여주는 공간 관계와 다른 객체와 관련된 한 객체의 상태를 보여주는 묘사 관계와 전치사로 표현되는 두 객체 사이의 의미적 관계를 보여주는 전치 관계와 다른 객체와 비교되는 한 객체의 상대적인 속성을 보여주는 비교 관계 중 적어도 하나를 포함하고, 상기 전역 3차원 장면 그래프상에 객체 쌍간의 주종 관계가 화살표 방향으로 표시될 수 있다.In an embodiment, the global 3D scene graph includes a node for expressing a recognized object and an edge for expressing a relationship between an object pair, and the node corresponds to a recognized object in the environment an identification number that is a uniquely assigned number, a semantic label for indicating the category of an object, and at least one of physical properties of an object and visual characteristics of an object, wherein the edge includes The motion relation showing the action toward the direction, the spatial relation showing the spatial relation such as distance and relative position, the descriptive relation showing the state of one object in relation to another object, the transposition relation showing the semantic relation between two objects expressed as a preposition, At least one of a comparison relationship showing a relative property of one object compared with another object may be included, and a master-slave relationship between pairs of objects may be displayed in an arrow direction on the global 3D scene graph.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

본 발명은 3차원 장면 그래프 생성 방법 및 장치를 제공하는 장점이 있다.The present invention has the advantage of providing a method and apparatus for generating a 3D scene graph.

또한, 본 발명은 계산량과 메모리 사용량/점유율이 적고, 다양한 환경 정보를 풍부하게 포함하며, 광범위한 적용성, 직관적인 사용성 및 높은 확장성을 가지는 다목적 3차원 환경 모델을 제공하는 것이 가능한 3차원 장면 그래프 생성 방법 및 장치를 제공하는 장점이 있다.In addition, the present invention is a 3D scene graph capable of providing a multi-purpose 3D environment model with a small amount of computation and memory usage/occupancy, richly including various environmental information, and having wide applicability, intuitive usability, and high scalability. It is advantageous to provide a production method and apparatus.

또한, 본 발명은 노드(Vertex)와 선(Edge)을 통해서 물리 정보와 의미 정보를 정확하고 풍부하게 제공하는 것이 가능한 3차원 환경 모델을 제공하는 것이 가능한 3차원 장면 그래프 생성 방법 및 장치를 제공하는 장점이 있다.In addition, the present invention provides a 3D scene graph generating method and apparatus capable of providing a 3D environment model capable of accurately and abundantly providing physical information and semantic information through nodes (Vertex) and lines (Edge). There are advantages.

또한, 본 발명은 로봇과 같은 지능형 에이전트가 주변 환경을 빠르고, 정확하고, 간결하게 인식할 수 있는 3차원 장면 그래프를 제공하는 장점이 있다.In addition, the present invention has the advantage of providing a three-dimensional scene graph in which an intelligent agent such as a robot can quickly, accurately, and concisely recognize the surrounding environment.

또한, 본 발명은 시각 장애인이 주변 환경을 보다 정확하게 인식할 수 있는 3차원 장면 그래프를 제공하는 장점이 있다.In addition, the present invention has the advantage of providing a three-dimensional scene graph that allows the visually impaired to more accurately recognize the surrounding environment.

이 외에, 본 문서를 통해 직접적 또는 간접적으로 파악되는 다양한 효과들이 제공될 수 있다.In addition, various effects directly or indirectly identified through this document may be provided.

도 1은 본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 장치에서 3차원 장면 그래프를 생성하는 방법을 설명하기 위한 순서도이다.
도 2는 본 발명의 일 실시 예에 따른 이미지 제거 절차를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시 예에 따른 포즈 추출 과정을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시 예에 따른 이미지 프레임의 분류 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시 예에 따른 키프레임 그룹 추출 절차를 설명하기 위한 순서도이다.
도 6은 본 발명의 일 실시 예에 따른 오류 관계 제거 방법을 설명하기 위한 순서도이다.
도 7은 본 발명의 일 실시 예에 따른 3D 장면 그래프의 구조를 설명하기 위한 도면이다.
도 8은 본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 장치의 구조를 설명하기 위한 블록도이다.
도 9는 예시적인 실시 예에 따른 전처리된 이미지 프레임이다.
도 10는 예시적인 실시 예에 따른 객체 영역이 표시된 이미지 프레임을 보여준다.
도 11은 예시적인 실시 예에 따른 인식된 객체가 표시된 이미지 프레임을 보여준다.
도 12는 예시적인 실시 예에 따른 3차원 장면 그래프를 보여준다.
도 13은 예시적인 실시 예에 따른 키프레임 그룹 추출 알고리즘을 보여준다.
도 14는 본 발명에 따른 3차원 장면 그래프의 예시적인 응용 사례를 설명하기 위한 도면이다.1 is a flowchart illustrating a method of generating a 3D scene graph in an apparatus for generating a 3D scene graph according to an embodiment of the present invention.
2 is a diagram for explaining an image removal procedure according to an embodiment of the present invention.
3 is a view for explaining a pose extraction process according to an embodiment of the present invention.
4 is a diagram for explaining a method of classifying an image frame according to an embodiment of the present invention.
5 is a flowchart illustrating a keyframe group extraction procedure according to an embodiment of the present invention.
6 is a flowchart illustrating a method of removing an error relationship according to an embodiment of the present invention.
7 is a diagram for explaining the structure of a 3D scene graph according to an embodiment of the present invention.
8 is a block diagram illustrating a structure of an apparatus for generating a 3D scene graph according to an embodiment of the present invention.
Fig. 9 is a pre-processed image frame according to an exemplary embodiment.
10 shows an image frame in which an object region is displayed, according to an exemplary embodiment.
11 shows an image frame in which a recognized object is displayed, according to an exemplary embodiment.
12 shows a 3D scene graph according to an exemplary embodiment.
13 shows a keyframe group extraction algorithm according to an exemplary embodiment.
14 is a diagram for explaining an exemplary application case of a 3D scene graph according to the present invention.

이하, 본 발명의 일부 실시 예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시 예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function interferes with the understanding of the embodiment of the present invention, the detailed description thereof will be omitted.

본 발명의 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. In addition, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

이하, 도 1 내지 도 13을 참조하여, 본 발명의 실시 예들을 구체적으로 설명하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to FIGS. 1 to 13 .

도 1은 본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 장치에서 3차원 장면 그래프를 생성하는 방법을 설명하기 위한 순서도이다.1 is a flowchart illustrating a method of generating a 3D scene graph in an apparatus for generating a 3D scene graph according to an embodiment of the present invention.

이하 설명의 편의를 위해 3차원 장면 그래프 생성 장치를 간단히 ‘장치＇라 명하기로 한다. 장치의 구성은 후술할 도면을 통해 보다 상세하게 설명하기로 한다. Hereinafter, for convenience of description, the device for generating a 3D scene graph will be simply referred to as a 'device'. The configuration of the device will be described in more detail with reference to the drawings to be described later.

도 1을 참조하면, 장치는 일련의 이미지가 입력되면, 입력된 이미지를 전처리할 수 있다(S110).Referring to FIG. 1 , when a series of images are input, the device may pre-process the input images ( S110 ).

일 예로, 이미지 전처리 단계에서 해당 이미지 내 포함된 노이즈가 제거되고, 적절한 스케일링 및 자르기가 적용될 수 있다.For example, in the image pre-processing step, noise included in the image may be removed, and appropriate scaling and cropping may be applied.

장치는 전처리된 이미지 프레임 중 선명하지 않은 이미지 프레임을 제거할 수 있다(S120). 장치는 선명하지 않은 이미지 프레임을 제거함으로써, 이 후 단계에서의 안정적인 성능을 보장할 수 있다. 선명하지 않은 이미지 프레임을 제거하는 방법은 후술할 도 2를 통해 상세히 설명하기로 한다.The device may remove an image frame that is not clear among the pre-processed image frames ( S120 ). The device can ensure stable performance in subsequent steps by removing the image frames that are not sharp. A method of removing the unclear image frame will be described in detail with reference to FIG. 2 to be described later.

장치는 전처리된 이미지 프레임에 기반하여 포즈를 추정할 수 있다(S130). 일 예로, 장치는 시각적 거리 측정법(visual odometry) 또는 동시적 위치 추정 및 지도 작성(SLAM: Simultaneous localization and mapping)을 사용하여 주변 프레임들 사이의 상대 포즈를 추출할 수 있다. 여기서, 추정된 포즈들은 키프레임 그룹 및 객체의 물리적 속성을 추출하는데 사용될 수 있다. 실시 예로, 포즈를 추정하기 위해 ElasticFusion 및 BundleFusion이 사용될 수 있다.The device may estimate the pose based on the pre-processed image frame (S130). As an example, the device may extract a relative pose between neighboring frames using visual odometry or simultaneous localization and mapping (SLAM). Here, the estimated poses may be used to extract the keyframe group and physical properties of the object. In an embodiment, ElasticFusion and BundleFusion may be used to estimate a pose.

일반적으로, SLAM은 로봇 청소기 등의 지능형 에이전트에 적용되는 개념으로, 임의 공간에서 이동하면서 주변을 탐색하여 해당 공간의 지도 및 현재 위치를 추정하는 기술이다.In general, SLAM is a concept applied to intelligent agents such as robot vacuum cleaners, and is a technology for estimating a map and current location of a space by searching around while moving in an arbitrary space.

장치는 추출된 포즈에 기반하여 키프레임 그룹을 추출할 수 있다(S140). 상기 키프레임 그룹 추출 단계에서 장치는 각각의 이미지 프레임을 키프레임(Key Frame), 앵커프레임(Anchor Frame), 가비지프레임(Garbage Frame)으로 분류할 수 있다. 키프레임 그룹을 추출하는 구체적인 방법은 후술할 도 5를 통해 상세히 설명하기로 한다.The device may extract a keyframe group based on the extracted pose (S140). In the keyframe group extraction step, the device may classify each image frame into a key frame, an anchor frame, and a garbage frame. A specific method of extracting the keyframe group will be described in detail with reference to FIG. 5, which will be described later.

장치는 추출된 키프레임 그룹에서의 오류 관계를 감지하고 제거할 수 있다(S150). 장치는 키프레임 그룹 내 포함된 프레임들의 중복 객체 영역을 제거하고, 객체 영역 내 인식 및 분류된 객체 중 미리 정의된 관련 없는 객체를 제거한 후 객체 관계를 필터링함으로써 오류 관계를 제거할 수 있다. 오류 관계를 제거하는 구체적인 방법은 후술할 도 6을 통해 상세히 설명하기로 한다. The device may detect and remove the erroneous relationship in the extracted keyframe group (S150). The device may remove the erroneous relationship by removing the overlapping object region of frames included in the keyframe group, removing a predefined irrelevant object among the recognized and classified objects in the object region, and then filtering the object relationship. A specific method of removing the error relationship will be described in detail with reference to FIG. 6 to be described later.

실시 예에 따른 장치는 영역 제안 네트워크(Region Proposal Network) 및 검출 네트워크(Detection Network)를 포함하는 보다 빠른 영역 기반의 컨볼루션 신경망(Faster-RCNN: Faster Region based Convolution Neural Network)을 이용하여 객체 영역을 감지하고, 객체 영역 내 객체를 인식 및 분류할 수 있다.The device according to the embodiment is an object region using a faster region-based convolutional neural network (Faster-RCNN) including a region proposal network and a detection network. It can detect, recognize and classify objects in the object area.

또한, 장치는 인식된 객체 쌍간의 관계를 추출하고, 객체의 물리적 속성 및 시각적 특징을 추출할 수 있다. Also, the device may extract a relationship between a pair of recognized objects, and extract physical properties and visual characteristics of the object.

장치는 오류 관계를 제거한 후 지역 3차원 장면 그래프(Local 3D Scene Graph)를 생성할 수 있다(S160).After removing the error relationship, the device may generate a local 3D scene graph ( S160 ).

장치는 지역 3차원 장면 그래프를 병합하여 전역 3차원 장면 그래프(Global 3D Scene Graph)를 생성 및 갱신할 수 있다(S170).The device may generate and update a global 3D scene graph by merging the local 3D scene graph ( S170 ).

3차원 장면 그래프를 생성하는 방법은 후술할 도 7을 통해 상세히 설명하기로 한다.A method of generating a 3D scene graph will be described in detail with reference to FIG. 7 to be described later.

도 2는 본 발명의 일 실시 예에 따른 이미지 제거 절차를 설명하기 위한 도면이다.2 is a diagram for explaining an image removal procedure according to an embodiment of the present invention.

상세하게 도 2는 전처리된 이미지 프레임 중 선명하지 않은 이미지 프레임을 제거하는 절차를 설명하기 위한 도면이다.In detail, FIG. 2 is a diagram for explaining a procedure of removing an image frame that is not clear among pre-processed image frames.

도 2를 참조하면, 장치는 라플라스 분산(variance of Laplacian)을 산출하여 흐릿한 정도(blurriness) V 산출할 수 있다(S210). 여기서, V는 도면 번호 211의 수식에 따라 산출될 수 있다.Referring to FIG. 2 , the apparatus may calculate a blurriness V by calculating a variance of Laplacian ( S210 ). Here, V may be calculated according to the equation of reference numeral 211.

장치는 산출된 V에 대한 지수 이동 평균(Exponential moving average, St)를 산출할 수 있다(S220). 여기서, St는 도면 번호 221의 수식에 따라 산출될 수 있다.The device may calculate an exponential moving average (St) for the calculated V (S220). Here, St may be calculated according to the equation of reference numeral 221 .

장치는 지수 이동 평균 St에 대한 바이어스 보정을 수행할 수 있다(S230). 여기서, 바이어스 보정된 지수 이동 평균 St＇는 도면 번호 231의 수식에 따라 산출될 수 있다.The device may perform bias correction on the exponential moving average St (S230). Here, the bias-corrected exponential moving average St' may be calculated according to the equation of reference numeral 231.

장치는 바이어스 보정된 지수 이동 평균 St＇에 기반하여 임계값 t_blurry를 결정할 수 있다(S240). 여기서, 임계값 t_blurry는 도면 번호 241의 수식에 따라 결정될 수 있다. _{The device may determine the threshold value t blurry} based on the bias-corrected exponential moving average St' (S240). Here, the threshold value t _blurry may be determined according to the equation of reference number 241 .

장치는 상기 210 단계에서 산출된 흐릿한 정보 V와 상기 240 단계에서 결정된 임계값 t_blurry를 비교하여 전처리된 이미지 프레임 중 제거 대상 이미지 프레임을 결정할 수 있다(S250). The apparatus may determine an image frame to be removed from among the preprocessed image frames by comparing the blurry information V calculated in step 210 with the threshold value t _{blurry determined in step 240 ( S250 ).}

도 3은 본 발명의 일 실시 예에 따른 포즈 추출 과정을 설명하기 위한 도면이다.3 is a view for explaining a pose extraction process according to an embodiment of the present invention.

도 3을 참조하면, 장치는 시각적 거리 측정법(visual odometry) 또는 동시적 위치 추정 및 지도 작성(SLAM: Simultaneous localization and mapping) 기법을 사용하여 주변 이미지 프레임들 사이의 상대 포즈를 추출할 수 있다. Referring to FIG. 3 , the device may extract a relative pose between neighboring image frames by using a visual odometry or a simultaneous localization and mapping (SLAM) technique.

여기서, 추출된 포즈들은 키프레임 그룹 내 인식된 객체의 물리적 속성을 추출하는데 사용될 수 있다. 실시 예로, 포즈 추정을 위한 기법으로 ElasticFusion 및(또는) BundleFusion이 사용될 수 있다.Here, the extracted poses may be used to extract physical properties of objects recognized in the keyframe group. In an embodiment, ElasticFusion and/or BundleFusion may be used as a technique for pose estimation.

도 4는 본 발명의 일 실시 예에 따른 이미지 프레임의 분류 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method of classifying an image frame according to an embodiment of the present invention.

장치는 전처리된 이미지 스트림을 아래 3개의 클래스로 분류하여 키프레임 그룹을 추출하고, 불필요한 이미지 프레임을 제거할 수 있다. 여기서, 3개의 클래스는 키프레임(Key frame), 앵커프레임(Anchor frame) 및 가비지프레임(Garbage frame)일 수 있다.The device can classify the preprocessed image stream into the following three classes to extract keyframe groups and remove unnecessary image frames. Here, the three classes may be a key frame, an anchor frame, and a garbage frame.

키프레임은 레퍼런스 프레임으로 첫번째 앵커프레임을 결정하고, 키프레임 그룹의 적용 범위를 결정하기 위해 사용될 수 있다. The keyframe may be used to determine the first anchor frame as a reference frame and to determine the coverage range of the keyframe group.

장치는 흐릿하지 않은 첫번째 입력 프레임(Nonblurry first incoming frame)을 첫번째 키프레임으로 결정할 수 있다.The device may determine a Nonblurry first incoming frame as the first keyframe.

앵커프레임은 활성 또는 비활성화될 수 있다. 가장 최근의 앵커프레임만이 활성화되며, 활성화된 앵커프레임은 다음 앵커프레임을 결정하기 위해 사용될 수 있다.Anchor frames may be active or inactive. Only the most recent anchor frame is activated, and the activated anchor frame can be used to determine the next anchor frame.

가비지프레임은 키프레임도 앵커프레임도 아닌 이미지 프레임으로, 오버래핑 프레임 또는 중복 프레임으로 간주되어 폐기될 수 있다. Garbage frames are image frames that are neither keyframes nor anchor frames, and can be discarded as overlapping frames or overlapping frames.

도 5는 본 발명의 일 실시 예에 따른 키프레임 그룹 추출 절차를 설명하기 위한 순서도이다.5 is a flowchart illustrating a keyframe group extraction procedure according to an embodiment of the present invention.

도 5를 참조하면, 장치는 일련의 이미지 프레임을 수신하면 현재 키프레임과 입력된 이미지 프레임 사이의 오버랩 비율인 제1 값을 산출할 수 있다(S510 내지 S520).Referring to FIG. 5 , upon receiving a series of image frames, the device may calculate a first value that is an overlap ratio between a current key frame and an input image frame ( S510 to S520 ).

장치는 산출된 제1 값을 미리 정의된 제1 기준치와 비교할 수 있다(S530).The device may compare the calculated first value with a predefined first reference value (S530).

비교 결과, 제1 값이 제1 기준치 이하이면, 장치는 새로 입력된 이미지 프레임을 다음 키프레임으로 결정할 수 있다(S540). 여기서, 다음 키프레임으로 결정된 이미지 프레임은 키프레임 그룹에 추가될 수 있다.As a result of the comparison, if the first value is equal to or less than the first reference value, the device may determine the newly input image frame as the next key frame ( S540 ). Here, the image frame determined as the next keyframe may be added to the keyframe group.

상기한 530 단계의 비교 결과, 제1 값이 제1 기준치를 초과하면, 장치는 현재 활성화된 앵커프레임과 입력된 이미지 프레임 사이의 오버랩 비율인 제2 값을 산출할 수 있다(S550).As a result of the comparison in step 530, if the first value exceeds the first reference value, the device may calculate a second value that is an overlap ratio between the currently activated anchor frame and the input image frame (S550).

장치는 산출된 제2 값과 미리 정의된 제2 기준치를 비교할 수 있다(S560). 여기서, 제2 기준치는 상기 제1 기준치보다 큰 값으로 정의될 수 있다.The device may compare the calculated second value with a predefined second reference value (S560). Here, the second reference value may be defined as a value greater than the first reference value.

비교 결과, 제2 값이 제2 기준치 이하이면, 장치는 새로 입력된 이미지 프레임을 새로운 앵커프레임으로 결정할 수 있다(S570). 이때, 현재 활성화된 앵커프레임은 비활성화되고, 새롭게 결정된 앵커프레임이 활성화될 수 있다. As a result of the comparison, if the second value is equal to or less than the second reference value, the device may determine the newly input image frame as a new anchor frame ( S570 ). In this case, the currently activated anchor frame may be deactivated, and the newly determined anchor frame may be activated.

상기 560 단계의 비교 결과, 제2 값이 제2 기준치 이하이면, 장치는 현재 활성화된 앵커프레임을 다음 앵커프레임으로 유지하고, 새로 입력된 이미지 프레임을 가비지프레임으로 분류하여 폐기할 수 있다(S580).As a result of the comparison in step 560, if the second value is less than or equal to the second reference value, the device may maintain the currently activated anchor frame as the next anchor frame, classify the newly input image frame as a garbage frame, and discard it (S580) .

이하에서는 예시적인 실시 예에 따른 상기 오버랩 산출 방법을 상세히 설명하기로 한다.Hereinafter, the overlap calculation method according to an exemplary embodiment will be described in detail.

도 6은 본 발명의 일 실시 예에 따른 오류 관계 제거 방법을 설명하기 위한 순서도이다.6 is a flowchart illustrating a method of removing an error relationship according to an embodiment of the present invention.

도 6을 참조하면, 장치는 해당 이미지 프레임에 상응하여 객체 영역을 추출할 수 있다(S610). 여기서, 장치는 키프레임 그룹에 포함된 이미지 프레임 대해 객체 영역을 추출하여 제안할 수 있다.Referring to FIG. 6 , the device may extract an object region corresponding to a corresponding image frame ( S610 ). Here, the device may extract and propose an object region for an image frame included in the keyframe group.

장치는 추출된 객체 영역 중 중복 객체 영역을 제거할 수 있다(S620). 장치는 중복 객체 영역을 제거함으로써, 유효한 객체 영역을 식별할 수 있다.The device may remove the duplicate object area among the extracted object areas ( S620 ). The device may identify a valid object area by removing the duplicate object area.

장치는 유효한 객체 영역 내 객체를 인지하고, 인지된 객체 중 미리 정의된 관련 없는 객체를 제거할 수 있다(S630). 일 예로, 미리 정의된 관련 없는 객체는 길, 하늘, 건물 및 움직이는 객체 등을 포함할 수 있으나, 이에 한정되지는 않는다. 장치는 관련 없는 객체를 제거함으로써, 유효한 객체를 식별할 수 있다.The device may recognize an object in the valid object area and remove a predefined unrelated object from among the recognized objects ( S630 ). For example, the predefined unrelated object may include, but is not limited to, a road, a sky, a building, and a moving object. The device may identify valid objects by removing irrelevant objects.

장치는 식별된 객체 사이의 객체 관계를 필터링을 수행하여 객체 사이의 불필요한 관계를 제거할 수 있다(S640). The device may filter the object relationship between the identified objects to remove unnecessary relationships between the objects ( S640 ).

일 예로, 장치는 하나의 키프레임 그룹에 포함되어 있는 여러 프레임의 객체 쌍 사이에서 감지된 관계를 검사하고 가장 많이 발생된 관계-즉, 최상위 관계-를 장면 그래프에 포함시킬 수 있다. 만약, 동점이 발생하면 장치는 모든 최상위 관계를 장면 그래프에 추가할 수 있다. As an example, the device may examine a relationship detected between object pairs of multiple frames included in one keyframe group, and include the relationship that occurs the most—that is, the highest relationship—in the scene graph. If a tie occurs, the device can add all top-level relationships to the scenegraph.

장치는 사전 수집된 시각적 게놈 데이터 집합을(Visual Genome Dataset) 및 가우스 분포로 픽셀 거리 정보(dpixel)를 사용하여 관계 확률(probability of a relation)을 산출하여 객체 쌍에 대한 관계 통계를 메모리(미도시)에 저장할 수 있다 The device calculates the probability of a relation using the pre-collected Visual Genome Dataset and the pixel distance information (dpixel) as a Gaussian distribution to store relational statistics for pairs of objects in memory (not shown). ) can be stored in

도 7은 본 발명의 일 실시 예에 따른 3D 장면 그래프의 구조를 설명하기 위한 도면이다. 7 is a diagram for explaining the structure of a 3D scene graph according to an embodiment of the present invention.

도 7을 참조하면, 3D 장면 그래프는 이미지를 그래프 구조로 나타내며, 크게 객체 인스턴스를 표현하기 위한 노드(Node, 710, 720)와 인접한 두 객체-즉, 노드- 사이의 상호 관계를 표현하기 위한 에지(Edge, 730)를 포함하여 구성될 수 있다.Referring to FIG. 7 , a 3D scene graph represents an image in a graph structure, and a node (Node, 710, 720) for representing an object instance and an edge for expressing a mutual relationship between two adjacent objects - that is, a node - (Edge, 730) may be configured to include.

시각적 장면에 대한 이전 텍스트 기반 표현과 비교할 때 3D 장면 그래프 기반 표현은 상대적 지오메트리(geometry) 및 시맨틱(semantic)과 관련하여 많은 상황 정보를 제공할 수 있다. Compared to previous text-based representations of visual scenes, 3D scene graph-based representations can provide a lot of contextual information in terms of relative geometry and semantics.

각각의 노드(710, 720)는 주어진 환경에서 특정 객체를 위해 할당되는 고유 번호인 식별 번호(Identification Number)와 객체의 카테고리를 표시하기 위한 시멘틱 라벨(Semantic Label)을 가질 수 있다.Each of the nodes 710 and 720 may have an identification number, which is a unique number assigned for a specific object in a given environment, and a semantic label for indicating the category of the object.

또한, 노드(710, 720)는 객체의 크기, 주요 색깔, 첫번째 키프레임을 기준으로 하는 위치 등을 표현하는 물리적 속성(Physical Attributes, 711)과 썸네일(thumbnail), 칼라 히스트그램(color histogram) 또는 디스크립터(descriptor)를 가지는 SIFT(Scale-Invariant Feature Transform) 및 ORB(Oriented FAST and Rotated BRIEF)와 같은 추출된 시각적 특성을 표현하는 시각적 특성(Visual Feature, 712)을 포함할 수 있다.In addition, the nodes 710 and 720 have physical attributes (Physical Attributes, 711) expressing the size of the object, the main color, the position relative to the first keyframe, and the like, thumbnails, and color histograms. Alternatively, it may include a visual feature 712 representing the extracted visual feature, such as SIFT (Scale-Invariant Feature Transform) and ORB (Oriented FAST and Rotated BRIEF) having a descriptor.

에지(730)는 두 객체 사이의 다양한 관계 특성을 표현할 수 있다. The edge 730 may represent various characteristics of a relationship between two objects.

일 예로, 에지(730)는 동작 관계, 공간 관계, 묘사 관계, 전치 관계, 비교 관계 등이 표현할 수 있다.As an example, the edge 730 may represent a motion relationship, a spatial relationship, a depiction relationship, a transposition relationship, a comparison relationship, and the like.

동작 관계는 한 객체가 다른 객체를 향한 행동을 보여준다. 예를 들면, 먹이주기(feeding) 등을 포함할 수 있다.A behavioral relationship shows the behavior of one object towards another. For example, it may include feeding and the like.

공간 관계는 거리 및 상대적인 위치와 같은 공간적인 관계를 보여준다. 예를 들면, ~의 앞에(in front of) 등을 포함할 수 있다.Spatial relationships show spatial relationships such as distance and relative position. For example, it may include in front of and the like.

묘사 관계는 다른 객체와 관련된 한 객체의 상태를 보여준다. 예를 들면, 쓰고 있는(wear) 등을 포함할 수 있다.A descriptive relationship shows the state of one object relative to another. For example, it may include wear and the like.

전치 관계는 전치사로 표현되는 두 객체 사이의 의미적 관계를 보여준다. 예를 들면, with, on, above 등을 포함할 수 있다.A prepositional relationship shows a semantic relationship between two objects expressed as a preposition. For example, it may include with, on, above, and the like.

비교 관계는 다른 객체와 비교되는 한 객체의 상대적인 속성을 보여준다. 예를 들면, 보다 작은(smaller) 등을 포함할 수 있다. A comparison relationship shows the relative properties of one object compared to another. For example, it may include smaller and the like.

도 8은 본 발명의 일 실시 예에 따른 3차원 장면 그래프 생성 장치의 구조를 설명하기 위한 블록도이다.8 is a block diagram illustrating a structure of an apparatus for generating a 3D scene graph according to an embodiment of the present invention.

이하, 설명의 편의를 위해 3차원 장면 그래프 생성 장치(800)를 간단히 ‘장치(800)’라 명하기로 한다.Hereinafter, for convenience of description, the 3D scene graph generating apparatus 800 will be simply referred to as a 'device 800'.

도 8을 참조하면, 장치(800)는 이미지 수신 모듈(810), 이미지 전처리 모듈(820), 이미지 제거 모듈(830), 포즈 추정 모듈(840), 키프레임 그룹 추출 모듈(850), 오류 관계 제거 모듈(860), 신경망(870) 및 그래프 생성 모듈(880)을 포함하여 구성될 수 있다.Referring to FIG. 8 , the device 800 includes an image receiving module 810 , an image preprocessing module 820 , an image removal module 830 , a pose estimation module 840 , a keyframe group extraction module 850 , and an error relationship. It may be configured to include a removal module 860 , a neural network 870 , and a graph generation module 880 .

이미지 수신 모듈(810)은 이미지 센서-예를 들면, 카메라-로부터 일련의 이미지 프레임을 수신할 수 있다.The image receiving module 810 may receive a series of image frames from an image sensor (eg, a camera).

이미지 전처리 모듈(820)은 이미지 내 포함된 노이즈를 제거하고, 적절한 스케일링 및 자르기 작업을 수행할 수 있다.The image pre-processing module 820 may remove noise included in the image and perform appropriate scaling and cropping operations.

이미지 제어 모듈(830)은 전처리된 이미지 프레임 중 선명하지 않은 이미지 프레임을 제거할 수 있다. 선명하지 않은 이미지 프레임을 제거함으로써, 이 후 단계에서의 안정적인 성능이 보장될 수 있다.The image control module 830 may remove an unclear image frame from among the pre-processed image frames. By removing the image frames that are not sharp, stable performance in later stages can be ensured.

포즈 추정 모듈(840)은 전처리된 이미지 프레임에 기반하여 포즈를 추정할 수 있다. 일 예로, 포즈 추정 모듈은 시각적 거리 측정법(visual odometry) 또는 동시적 위치 추정 및 지도 작성(SLAM: Simultaneous localization and mapping)을 사용하여 주변 프레임들 사이의 상대 포즈를 추출할 수 있다. 여기서, 추정된 포즈들은 키프레임 그룹 및 객체의 물리적 속성을 추출하는데 사용될 수 있다. 실시 예로, 포즈를 추정하기 위해 ElasticFusion 및 BundleFusion이 사용될 수 있다.The pose estimation module 840 may estimate a pose based on the preprocessed image frame. As an example, the pose estimation module may extract a relative pose between neighboring frames using visual odometry or simultaneous localization and mapping (SLAM). Here, the estimated poses may be used to extract the keyframe group and physical properties of the object. In an embodiment, ElasticFusion and BundleFusion may be used to estimate a pose.

키프레임 그룹 추출 모듈(850)은 오버랩 산출부(851), 클래스 분류부(852) 및 그룹 갱신부(853)을 포함하여 구성될 수 있다. The keyframe group extraction module 850 may include an overlap calculation unit 851 , a class classification unit 852 , and a group update unit 853 .

오버랩 산출부(851)는 신규 입력된 이미지 프레임과 키프레임 사이의 오버랩 비율인 제1 값을 산출할 수 있다. The overlap calculator 851 may calculate a first value that is an overlap ratio between the newly input image frame and the key frame.

또한, 오버랩 산출부(851)는 신규 입력된 이미지 프레임과 앵커프레임 사이의 오버랩 비율인 제2 값을 산출할 수 있다. Also, the overlap calculator 851 may calculate a second value that is an overlap ratio between the newly input image frame and the anchor frame.

클래스 분류부(852)는 제1 값과 제1 기준치를 비교하여 키프레임을 분류하고, 제2 값과 제2 기준치를 비교하여 앵커프레임을 분류할 수 있다.The class classifier 852 classifies the keyframe by comparing the first value with the first reference value, and classifies the anchor frame by comparing the second value with the second reference value.

또한, 클래스 분류부(852)는 키프레임 또는 앵커프레임으로 분류되지 않은 프레임을 가비지프레임으로 분류하여 폐기할 수 있다.Also, the class classification unit 852 may classify a frame that is not classified as a key frame or an anchor frame as a garbage frame and discard it.

그룹 갱신부(853)는 클래스 분류부(852)에 의해 분류된 키프레임을 키프레임 그룹에 추가하여 키프레임 그룹을 갱신할 수 있다.The group updater 853 may update the keyframe group by adding the keyframes classified by the class classifier 852 to the keyframe group.

키프레임 그룹 추출 모듈(850)의 세부적인 동작은 상기 도 5의 설명으로 대체하기로 한다.The detailed operation of the keyframe group extraction module 850 will be replaced with the description of FIG. 5 .

오류 관계 제거 모듈(860)은 중첩 객체 영역 제거부(861), 비관련 객체 제거부(862) 및 객체 관계 필터링부(863)을 포함하여 구성될 수 있다.The error relationship removing module 860 may include an overlapping object region removing unit 861 , an unrelated object removing unit 862 , and an object relation filtering unit 863 .

신경망(870)은 영역 제안부(871), 객체 인식부(872) 및 관계 추출부(873)을 포함하여 구성될 수 있다.The neural network 870 may be configured to include a region suggestion unit 871 , an object recognizer 872 , and a relationship extractor 873 .

오류 관계 제어 모듈(860)은 신경망(870)을 통해 추출된 객체 영역, 객체 영역 내 인식된 객체 및 인식된 객체들 사이의 관계에 대한 오류를 감지하여 제거할 수 있다. The error relationship control module 860 may detect and remove errors in the object region extracted through the neural network 870 , the object recognized in the object region, and the relationship between the recognized objects.

중첩 객체 영역 제거부(861)는 영역 제안부(871)에 의해 추출되어 제안된 객체 영역 중 중첩 객체 영역을 제거할 수 있다.The overlapping object area removing unit 861 may remove the overlapping object area from among the proposed object areas extracted by the area suggestion unit 871 .

비관련 객체 제거부(862)는 객체 인식부(872)에 인식된 객체 중 불필요한 객체를 제거할 수 있다. 여기서, 불필요한 객체는 미리 정의되어 설정될 수 있으며, 길, 하늘, 건물 및 움직이는 객체 등을 포함할 수 있으나, 이에 한정되지는 않는다.The unrelated object removal unit 862 may remove unnecessary objects from among the objects recognized by the object recognition unit 872 . Here, the unnecessary object may be predefined and set, and may include a road, a sky, a building, a moving object, and the like, but is not limited thereto.

객체 관계 필터링부(863)는 관계 추출부(873)에 의해 추출된 객체들 사이의 관계에 대해 객체 관계 필터링을 수행하여 객체 사이의 불필요한 관계를 제거할 수 있다. The object relationship filtering unit 863 may remove unnecessary relationships between objects by performing object relationship filtering on the relationships between the objects extracted by the relationship extracting unit 873 .

일 예로, 객체 관계 필터링부(863)는 하나의 키프레임 그룹에 포함되어 있는 여러 프레임의 객체 쌍 사이에서 감지된 관계를 검사하고 가장 많이 발생된 관계-즉, 최상위 관계-를 장면 그래프에 포함되도록 결정할 수 있다. 만약, 동점이 발생하면 객체 관계 필터링부(863)는 모든 최상위 관계가 장면 그래프에 추가되도록 결정할 수 있다. For example, the object relationship filtering unit 863 examines a relationship detected between object pairs of multiple frames included in one keyframe group, and includes the relationship that occurs the most - that is, the highest relationship - in the scene graph. can decide If a tie occurs, the object relationship filtering unit 863 may determine to add all top-level relationships to the scene graph.

본 발명의 실시 따른 객체 관계 필터링부(863)는 사전 수집된 시각적 게놈 데이터 집합을(Visual Genome Dataset) 및 가우스 분포로 픽셀 거리 정보(d_pixel)를 사용하여 관계 확률(probability of a relation)을 산출하여 객체 쌍에 대한 관계 통계를 메모리(미도시)에 저장할 수 있다. The object relation filtering unit 863 according to an embodiment of the present invention calculates a probability of a relation by using the _{pre-collected Visual Genome Dataset and the pixel distance information d pixel as a Gaussian distribution.} Thus, the relationship statistics for the object pair can be stored in a memory (not shown).

여기서, 관계 확률 p(r┃x,y)는 하기의 수식에 의해 산출될 수 있다.Here, the relationship probability p(r┃x,y) may be calculated by the following equation.

여기서, p_dict(r┃x,y)는 시각적 게놈 데이터 집합으로부터의 통계치이고,

이다.where p _dict (r┃x,y) is the statistic from the visual genomic data set,

am.

그래프 생성 모듈(880)은 지역 그래프 생성부(Local Graph Generation Unit, 881)와 전역 그래프 생성부(Global Graph Generation Unit, 882)를 포함하여 구성될 수 있다.The graph generation module 880 may include a local graph generation unit 881 and a global graph generation unit 882 .

지역 그래프 생성부(881)는 입력된 이미지 프레임으로부터 부분적인 객체 속성 및 객체 쌍간의 관계를 추출하여 2차원 버전의 장면 그래프를 형성한 후 2차원 속성들을 3차원 속성들로 변환하여 지역 3차원 장면 그래프를 생성할 수 있다.The local graph generator 881 extracts partial object properties and relationships between object pairs from the input image frame to form a two-dimensional version of the scene graph, then converts the two-dimensional properties into three-dimensional properties to form a local three-dimensional scene. You can create graphs.

객체의 3차원 위치는 해당 객체의 위치를 하나의 중심점만을 사용하여 표현하면 측정 오차가 발생할 수 있으므로 가우스 분포로 표현될 수 있다. 지역 그래프 생성부(881)는 객체 영역을 복수의 하위 영역-예를 들면, 5X5-으로 분할한 후 객체 영역 내 중심 사각형을 잘라낼 수 있다. 이후, 지역 그래프 생성부(881)는 첫 번째 키프레임 p에 대한 중심 사각형 각 점의 3차원 위치 p'를 다음의 수식에 따라 산출할 수 있다.The three-dimensional position of an object may be expressed as a Gaussian distribution because a measurement error may occur if the position of the object is expressed using only one central point. The region graph generator 881 may divide the object region into a plurality of sub-regions - for example, 5X5-, and then cut out a central rectangle within the object region. Thereafter, the regional graph generator 881 may calculate the three-dimensional position p' of each point of the central rectangle with respect to the first keyframe p according to the following equation.

여기서 i와 o는 각각 현재 프레임과 첫 번째 키프레임의 인덱스를 나타낸다. Here, i and o represent indexes of the current frame and the first keyframe, respectively.

지역 그래프 생성부(881)는 다음으로 객체의 색상 히스토그램 h_H,S,V를 다음의 수식에 따라 계산할 수 있다.Next, the local graph generator 881 may _{calculate the color histogram h H, S, V} of the object according to the following equation.

여기서 (H, S, V)는 3 개의 색 공간 축을 나타내고 N은 픽셀 수를 나타낸다. (H, S, V) 공간은 일반적으로 (R, G, B) 공간보다 우수한 성능을 보인다. 색상 공간의 각 축을 c-bins로 디지털화하여 히스토그램의 크기를 c3으로 만듭니다. 이때, 경계 상자 내부의 영역은 해당 객체의 썸네일이 된다.where (H, S, V) represents the three color space axes and N represents the number of pixels. (H, S, V) spaces generally outperform (R, G, B) spaces. Digitize each axis of the color space into c-bins to size the histogram to c3. In this case, the area inside the bounding box becomes the thumbnail of the corresponding object.

전역 그래프 생성부(881)는 하나의 이미지 프레임에 상응하여 생성된 지역 3차원 장면 그래프를 병합 및 갱신하여 전역 3차원 장면 그래프를 생성할 수 있다.The global graph generation unit 881 may generate a global 3D scene graph by merging and updating a local 3D scene graph generated corresponding to one image frame.

전역 그래프 생성부(881)는 단일 이미지 프레임에 대한 개별 3차원 장면 그래프-즉, 로컬 3차원 장면 그래프-를 하나의 전역 3차원 장면 그래프로 병합하고 이에 따라 전역 3차원 장면 그래프의 노드와 에지를 갱신할 수 있다. 카메라 뷰와 및 위치가 변하므로 신경망은 동일한 객체에 대해 다른 정보를 추출할 수 있다. 전역 그래프 생성부(881)는 이러한 변화를 보완하고 전체 환경에 대한 전역 3D 장면 그래프를 구성할 수 있다.The global 3D scene graph generation unit 881 merges the individual 3D scene graphs for a single image frame - that is, the local 3D scene graphs - into one global 3D scene graph, and accordingly, the nodes and edges of the global 3D scene graphs are Can be updated. Because the camera view and position change, the neural network can extract different information for the same object. The global graph generator 881 may compensate for these changes and construct a global 3D scene graph for the entire environment.

전역 그래프 생성부(881)는 전역 3차원 장면 그래프의 오리지널 노드와 현재 프레임의 후보 노드 사이의 라벨 유사도(Label similarity), 위치 유사도 Position similarity) 및 색상 유사도(Color similarity)에 기반하여 동일 노드를 감지할 수 있다. The global graph generator 881 detects the same node based on label similarity, position similarity, and color similarity between the original node of the global 3D scene graph and the candidate node of the current frame. can do.

동일 노드 감지 기능이 없는 경우, 3차원 장면 그래프에는 동일 노드가 여러 번 추가되어 그래프 내 노드 수가 폭증할 수 있으며, 그에 따라 동일한 객체에 대한 여러 관측 값을 효과적으로 통합할 수 없는 문제점이 있다.If there is no same node detection function, the same node may be added multiple times to the 3D scene graph, so the number of nodes in the graph may explode, so there is a problem in that it is not possible to effectively integrate multiple observation values for the same object.

도 9는 예시적인 실시 예에 따른 전처리된 이미지 프레임이다.Fig. 9 is a pre-processed image frame according to an exemplary embodiment.

전처리 모듈(820)은 입력된 이미지 프레임에 포함된 노이즈가 제거하고, 노이즈 제거된 이미지를 후단 처리에 적합한 사이즈로 스케일링할 수 있다.The pre-processing module 820 may remove noise included in the input image frame and scale the noise-removed image to a size suitable for post-processing.

제거 모듈(830)은 전처리된 이미지 프레임에서 흐릿한 정도(blurriness)가 기준치를 초과하는 이미지를 폐기할 수 있다.The removal module 830 may discard an image having a blurriness exceeding a reference value in the preprocessed image frame.

상기 도 9에 도시된 이미지는 제거 모듈(830) 후단의 이미지일 수 있으나, 이에 한정되지는 않는다.The image shown in FIG. 9 may be an image of the rear end of the removal module 830, but is not limited thereto.

도 10는 예시적인 실시 예에 따른 객체 영역이 표시된 이미지 프레임을 보여준다.10 shows an image frame in which an object region is displayed, according to an exemplary embodiment.

도 10에 도시된 바와 같이, 장치(800)의 신경망(870)은 전처리된 이미지 프레임에서 객체 영역을 추출하여 제안할 수 있다. As shown in FIG. 10 , the neural network 870 of the device 800 may extract and propose an object region from a preprocessed image frame.

도 10을 참조하면, 사각형 박스로 표시된 부분이 신경망(870)에 의해 제안된 객체 영역을 의미한다.Referring to FIG. 10 , a portion indicated by a rectangular box indicates an object region proposed by the neural network 870 .

도 11은 예시적인 실시 예에 따른 인식된 객체가 표시된 이미지 프레임을 보여준다.11 shows an image frame in which a recognized object is displayed, according to an exemplary embodiment.

도 11의 도면 번호 1110에 도시된 바와 같이, 객체 영역의 일측에는 신경망(870)에 의해 인식된 객체에 상응하는 객체 식별 번호 및 시멘틱 라벨이 표시될 수 있다.11 , an object identification number and a semantic label corresponding to the object recognized by the neural network 870 may be displayed on one side of the object region.

도면 번호 1120에 도시된 바와 같이, 3차원 장면 그래프에는 인식된 객체가 그래프 형태로 표시될 수 있다.As shown in reference numeral 1120 , the recognized object may be displayed in the form of a graph in the 3D scene graph.

도 12는 예시적인 실시 예에 따른 3차원 장면 그래프를 보여준다.12 shows a 3D scene graph according to an exemplary embodiment.

도면 번호 1210에 도시된 바와 같이, 3차원 장면 그래프에는 객체 쌍간의 관계가 표시될 수 있다.As illustrated in reference numeral 1210 , a relationship between pairs of objects may be displayed in the 3D scene graph.

도면 번호 1211을 참조하면, 벽(wall)이 그림(picture)을 가지는(have) 객체 쌍간의 관계가 3차원 장면 그래프를 통해 직관적으로 보여질 수 있다.Referring to reference numeral 1211, a relationship between a pair of objects in which a wall has a picture may be intuitively viewed through a 3D scene graph.

도면 번호 1220를 참조하면, 3차원 장면 그래프에는 객체의 물리적 속성 및 시각적 특성이 표시될 수 있다.Referring to reference numeral 1220 , physical properties and visual properties of an object may be displayed on the 3D scene graph.

도면 번호 1211을 참조하면, 그림(picture)의 물리적인 속성은 사각형(square)이고, 벽(wall)의 물리적 속성과 시각적 특성은 각각 넓고(wide), 갈색(brown)인 것이 3차원 장면 그래프를 통해 직관적으로 보여질 수 있다.Referring to reference number 1211, the physical property of the picture is a square, and the physical property and visual property of the wall are wide and brown, respectively, indicating a three-dimensional scene graph. can be shown intuitively.

도 12에 도시된 바와 같이, 객체 쌍간의 주종 관계를 화살표를 통해 식별될 수 있다. 일 예로, 도면 번호 1211을 참조하면, 벽(wall)이 그림(picture)를 가지므로 벽에서 그림 방향으로 화살표가 표시될 수 있다. 12 , a master-slave relationship between pairs of objects may be identified through arrows. As an example, referring to reference numeral 1211 , since the wall has a picture, an arrow may be displayed from the wall in the direction of the picture.

도 13은 예시적인 실시 예에 따른 키프레임 그룹 추출 알고리즘을 보여준다.13 shows a keyframe group extraction algorithm according to an exemplary embodiment.

상기 도 13의 키프레임 그룹 추출 알고리즘을 통해 장치(800)는 키프레임 그룹을 생성할 수 있을 뿐만 아니라 새로운 앵커프레임을 결정할 수 있다.Through the keyframe group extraction algorithm of FIG. 13 , the device 800 may generate a keyframe group and determine a new anchor frame.

본 발명의 실시 예에 따른 3차원 장면 그래프를 사용하면 지능형 에이전트는 주어진 환경을 정확하고 직관적으로 이해할 수 있다. 따라서 지능형 에이전트는 다양한 작업을 다양한 방식으로 수행할 수 있다. 이러한 작업에는 VQA(Visual Question and Answering), 작업 계획(task planning), 3차원 공간 캡션(3-Dimentional space captioning), 3차원 환경 모델 생성(3-Dimentional environment model generation) 및 장소 인식(place recognition) 등을 포함할 수 있다. Using the 3D scene graph according to the embodiment of the present invention, the intelligent agent can understand the given environment accurately and intuitively. Thus, intelligent agents can perform different tasks in different ways. These tasks include Visual Question and Answering (VQA), task planning, 3-Dimentional space captioning, 3-Dimentional environment model generation, and place recognition. and the like.

도 14는 본 발명에 따른 3차원 장면 그래프의 예시적인 응용 사례를 설명하기 위한 도면이다.14 is a diagram for explaining an exemplary application case of a 3D scene graph according to the present invention.

상세하게 도 14는 지능형 에이전트인 로봇이 3D 장면 그래프를 사용하여 PDDL에서 문제 설명을 자율적으로 공식화하여 주어진 작업을 수행하는 과정을 보여준다. 일반적으로 문제 설명은 사람에 의해 작성된다. 하지만, 3D 장면 그래프가 장착된 지능형 에이전트는 향상된 자율성을 갖춘 작업을 자동적으로 계획할 수 있는 장점이 있다.In detail, Fig. 14 shows a process in which a robot, an intelligent agent, autonomously formulates a problem description in PDDL using a 3D scene graph to perform a given task. Problem descriptions are usually written by humans. However, intelligent agents equipped with 3D scene graphs have the advantage of automatically planning tasks with increased autonomy.

도 14를 참조하면, 로봇 상태 및 목표-즉, 임무-가 초기화되면, 3차원 장면 그래프가 지능형 에이전트인 로봇에 입력될 수 있다(S1410 내지 S1420).Referring to FIG. 14 , when a robot state and a target - that is, a task - are initialized, a 3D scene graph may be input to the robot as an intelligent agent ( S1410 to S1420 ).

지능형 에이전트는 3차원 장면 그래프에 포함된 모든 객체가 인식될 때까지 노드 해석을 수행하여 객체 리스트 및 초기 상태를 갱신할 수 있다(S1430 내지 S1450).The intelligent agent may update the object list and initial state by performing node analysis until all objects included in the 3D scene graph are recognized (S1430 to S1450).

본 명세서에 개시된 실시 예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈, 또는 그 2 개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리 및/또는 스토리지)에 상주할 수도 있다. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in a storage medium (ie, memory and/or storage) such as RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM.

예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.An exemplary storage medium is coupled to the processor, the processor can read information from, and write information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and storage medium may reside as separate components within the user terminal.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. The above description is merely illustrative of the technical spirit of the present invention, and various modifications and variations will be possible without departing from the essential characteristics of the present invention by those skilled in the art to which the present invention pertains.

따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

Claims

(a) performing pre-processing on a series of input image frames;
(b) removing an image frame having a blurriness exceeding a reference value among the pre-processed image frames;
(c) classifying a class of the preprocessed image frame;
(d) extracting a keyframe group based on the classification result;
(e) detecting and removing an erroneous relationship to the keyframe group; and
(f) generating a 3D scene graph based on the frame from which the error relationship is removed
including,
A method for generating a 3D scene graph by examining a relationship detected between pairs of objects of several frames included in the keyframe group, and including the most frequently occurring relationship in the scene graph.

According to claim 1,
Step (b) is,
calculating a variance of Laplacian to calculate a blurriness V;
calculating an exponential moving average St for the V;
performing bias correction on the St;
determining a threshold value based on the bias-corrected exponential moving average; and
determining an image frame to be removed by comparing the V with the threshold value
A method of generating a 3D scene graph comprising a.

3. The method of claim 2,
The class includes a keyframe, an anchor frame, and a garbage frame,
The keyframe is used as a reference frame to determine the coverage of the first anchor frame and the keyframe group,
The anchor frame can be activated or deactivated, and the activated anchor frame is used to determine the next anchor frame,
The 3D scene graph generating method, characterized in that the garbage frame is a discarded frame.

4. The method of claim 3,
The step (c) is,
calculating a first value that is an overlap ratio between a current key frame and a newly input image frame; and
determining the newly input image frame as a next key frame if the first value is less than or equal to a first reference value;
A method of generating a 3D scene graph comprising a.

5. The method of claim 4,
The step (c) is,
calculating a second value that is an overlap ratio between the currently activated anchor frame and the newly input image frame when the first value exceeds the first reference value;
determining that the newly input image frame is a new anchor frame when the second value is less than or equal to a second reference value; and
When the second value exceeds a second reference value, classifying the newly input image frame as the garbage frame and discarding it
A method of generating a 3D scene graph further comprising a.

6. The method of claim 5,
Further comprising the step of extracting the pose between the step (b) and the step (c), using a visual odometry or simultaneous localization and mapping (SLAM: Simultaneous localization and mapping) technique A method for generating a 3D scene graph, characterized in that relative poses between surrounding image frames are extracted.

7. The method of claim 6,
The method for generating a 3D scene graph, characterized in that the extracted relative poses are used to extract the physical properties of the recognized object in the keyframe group.

According to claim 1,
Step (e) is,
extracting an object region corresponding to a keyframe included in the extracted keyframe group;
identifying a valid object area by removing a duplicate object area from among the extracted object areas;
identifying a valid object by removing a predefined unrelated object from among the recognized objects in the identified object area; and
filtering object relationships between the identified objects;
A method of generating a 3D scene graph comprising a.

According to claim 1,
The step (f) is,
generating a local 3D scene graph by extracting partial object properties and relationships between object pairs from the input image frame to form a 2D version of the scene graph, and then converting the 2D properties into 3D properties; and
updating the global 3D scene graph by merging the local 3D scene graph into the current global 3D scene graph.
A method of generating a 3D scene graph comprising a.

10. The method of claim 9,
Updating the global 3D scene graph comprises:
detecting the same node based on label similarity, position similarity, and color similarity between the original node of the global 3D scene graph and the candidate node of a newly input image frame; How to create a 3D scene graph.

10. The method of claim 9,
The global 3D scene graph includes a node for expressing a recognized object and an edge for expressing a relationship between a pair of objects.

12. The method of claim 11,
The node is
an identification number, which is a number uniquely assigned to a recognized object in the corresponding environment; and
A semantic label to indicate the category of an object
including,
The edge is
behavioral relationships in which one object exhibits behavior toward another;
Spatial relationships showing spatial relationships such as distance and relative position;
descriptive relationships that show the state of one object in relation to another;
a prepositional relationship showing a semantic relationship between two objects expressed as a preposition; and
A comparison relationship that shows the relative properties of one object compared to another.
3D scene graph generating method comprising at least one of.

13. The method of claim 12,
The method for generating a 3D scene graph, characterized in that the master-slave relationship between pairs of objects is displayed in the direction of an arrow in the global 3D scene graph.

13. The method of claim 12,
The node is a three-dimensional scene graph generating method further comprising a physical property and a visual property of the object.

an image preprocessing module that performs preprocessing on a series of input image frames;
an image removal module for removing an image frame having a blurriness exceeding a reference value among the pre-processed image frames;
a keyframe group extraction module for extracting a keyframe group by classifying the class of the preprocessed image frame;
an error relationship removal module for detecting and removing an error relationship with respect to the keyframe group; and
Create a 3D scene graph based on the frame from which the error relationship is removed,
A graph generating module that examines the relationship detected between object pairs of multiple frames included in the keyframe group, and includes the most frequently occurring relationship in the scene graph
A 3D scene graph generating device comprising a.

16. The method of claim 15,
The image removal module,
means for calculating a variance of Laplacian to yield a blurriness V;
means for calculating an exponential moving average St for the V;
means for performing bias correction on the St;
means for determining a threshold value based on the bias corrected exponential moving average; and
means for determining an image frame to be removed by comparing the V with the threshold
A 3D scene graph generating device comprising a.

17. The method of claim 16,
The class includes a keyframe, an anchor frame, and a garbage frame,
The keyframe is used as a reference frame to determine the coverage of the first anchor frame and the keyframe group,
The anchor frame can be activated or deactivated, and the activated anchor frame is used to determine the next anchor frame,
The 3D scene graph generating apparatus, characterized in that the garbage frame is a discarded frame.

18. The method of claim 17,
The keyframe group extraction module,
means for calculating a first value that is an overlap ratio between the current keyframe and the newly input image frame; and
If the first value is less than or equal to a first reference value, means for determining the newly input image frame as a next keyframe
A 3D scene graph generating device comprising a.

19. The method of claim 18,
The keyframe group extraction module,
means for calculating a second value, which is an overlap ratio between the currently activated anchor frame and the newly input image frame, when the first value exceeds the first reference value;
determining means for determining the newly input image frame as a new anchor frame if the second value is equal to or less than a second reference value; and
If the second value exceeds a second reference value, means for classifying the newly input image frame as the garbage frame and discarding it.
A 3D scene graph generating device further comprising a.

20. The method of claim 19,
Further comprising a pose estimation module for extracting a pose based on the preprocessed image frame, wherein the pose estimation module is a visual odometry (visual odometry) or simultaneous localization and mapping (SLAM: Simultaneous localization and mapping) technique A 3D scene graph generating device, characterized in that it extracts the relative poses between the surrounding image frames by using it.

16. The method of claim 15,
a region suggestion unit that extracts and proposes an object region corresponding to the image frame;
an object recognition unit for recognizing an object included in the object area; and
Relation extraction unit for extracting the relationship between the recognized object pair
3D scene graph generating apparatus, characterized in that it further comprises a neural network comprising a.

22. The method of claim 21,
The error relationship removal module is
an overlapping object area removing unit for identifying a valid object area by removing a duplicate object area from among the extracted object areas;
an unrelated object removal unit for identifying a valid object by removing a predefined unrelated object from among the recognized objects in the identified object area; and
An object relationship filtering unit for filtering object relationships between the identified objects
A 3D scene graph generating device comprising a.

16. The method of claim 15,
The graph generation module,
A local graph generation unit that extracts partial object properties and relationships between object pairs from the input image frame to form a two-dimensional version of the scene graph, then converts the two-dimensional properties into three-dimensional properties to generate a local three-dimensional scene graph ; and
A global graph generation unit that updates the global 3D scene graph by merging the local 3D scene graph into the current global 3D scene graph.
A 3D scene graph generating device comprising a.

24. The method of claim 23,
The global graph generator selects the same node based on label similarity, position similarity, and color similarity between the original node of the global 3D scene graph and the candidate node of a newly input image frame. A three-dimensional scene graph generating device, characterized in that it detects.

24. The method of claim 23,
The global 3D scene graph includes a node for expressing a recognized object and an edge for expressing a relationship between a pair of objects,
The node is
an identification number, which is a number uniquely assigned to a recognized object in the corresponding environment;
a semantic label for indicating a category of an object;
physical properties of the object; and
Visual properties of objects
comprising at least one of
The edge is
behavioral relationships in which one object exhibits behavior toward another;
Spatial relationships showing spatial relationships such as distance and relative position;
descriptive relationships that show the state of one object in relation to another;
a prepositional relationship showing a semantic relationship between two objects expressed as a preposition; and
A comparison relationship that shows the relative properties of one object compared to another.
comprising at least one of
3D scene graph generating apparatus, characterized in that the master-slave relationship between object pairs is displayed in the direction of an arrow on the global 3D scene graph.