KR20240121550A

KR20240121550A - Device, method and computer program for 3d pose estimation

Info

Publication number: KR20240121550A
Application number: KR1020230014339A
Authority: KR
Inventors: 이재영; 김이길
Original assignee: 주식회사 케이티
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2024-08-09

Abstract

본 발명의 실시예에 따른 3D 자세 추정 장치는 사람이 포함되는 이미지를 입력 받는 수신부; 상기 이미지를 분석하여 상기 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 상기 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성하는 2D 자세 추정부; 및 인접행렬을 이용하여 상기 스켈레톤 그래프에 대응되는 3D 자세를 추정하는 3D 자세 추정부; 를 포함하고, 상기 인접행렬은 k 번째(0<k<K, 여기서 K는 2 이상의 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬이 사용된다. According to an embodiment of the present invention, a 3D pose estimation device includes: a receiving unit which receives an image including a person; a 2D pose estimation unit which analyzes the image to extract types and positions of joints according to the pose of the person, and generates a skeleton graph based on the types and positions of the extracted joints; and a 3D pose estimation unit which estimates a 3D pose corresponding to the skeleton graph using an adjacency matrix; wherein, in the adjacency matrix, different adjacency matrices are used corresponding to each repeated process in the process of deriving the k-th (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

Description

3D pose estimation device, method and computer program {DEVICE, METHOD AND COMPUTER PROGRAM FOR 3D POSE ESTIMATION}

본 발명은 입력된 이미지를 분석하고 이미지 내 사람의 자세를 추정하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to a device, method and computer program for analyzing an input image and estimating a pose of a person in the image.

컴퓨터 비전은 인공 지능의 한 분야로 컴퓨터를 사용하여 인간의 시각적인 인식 능력 일반을 재현하는 연구 분야이다. 이 중, 자세 추정(Pose Estimation)은 이미지 내 사람의 몸가짐이나 일정한 태도를 취하는 모습을 인식하고, 주로 사람의 관절 위치를 찾는 것을 목표로 한다.Computer vision is a field of artificial intelligence that uses computers to reproduce the general visual recognition ability of humans. Among these, pose estimation recognizes the body posture or certain attitude of a person in an image, and mainly aims to find the position of the person's joints.

그리고 3D 자세 추정(Human Pose Estimation, HPE)은 단일 이미지 또는 영상과 같은 다중 이미지에서 카메라 카메라 좌표계 내 사람 신체 관절의 3D 위치를 도출하는 것을 목표로 한다. 3D 자세 추정은 사람-컴퓨터 간의 상호작용(Human-Computer Interaction, HCI)이나 제스쳐 인식 및 사람 행동 인식과 같은 분야에서 중요하게 작용한다. And 3D pose estimation (HPE) aims to derive the 3D position of human body joints within the camera coordinate system from a single image or multiple images such as videos. 3D pose estimation plays an important role in fields such as human-computer interaction (HCI), gesture recognition, and human action recognition.

하지만 자세 추정에 있어서 이미지내 사람이 다른 사람이나 객체에 의해 일부 가려져 폐색현상(occlusion)이 발생하는 경우, 관절 위치에 대한 낮은 확률값으로 정확한 추정이 어려운 문제가 있다. However, in pose estimation, if the person in the image is partially obscured by another person or object, causing occlusion, there is a problem in that it is difficult to make an accurate estimation due to the low probability value for the joint position.

최근 3D 자세 추정은 심층 신경망(Deep Neural Network)의 발달로 인해 종래의 방식보다 우수한 성능을 달성하게 되었다. 과거의 3D 자세 추정 방법은 이미지에 합성곱 신경망(Convolution Neural Network)을 통한 종단 간 파이프 라인(end-to-end pipeline)을 사용하거나 2 단계 파이프라인을 사용하였다.Recently, 3D pose estimation has achieved better performance than conventional methods due to the development of deep neural networks. Past 3D pose estimation methods used an end-to-end pipeline through a convolutional neural network on the image or a two-stage pipeline.

최근에는 이미지 내 사람의 관절 위치를 스켈레톤 그래프 형식으로 표현하고, 이에 대한 합성곱 신경망을 적용하여 3D 자세를 추정하는 그래프 합성곱 신경망(Graph Convolution Network, GCN)이 활용되는 추세이다. Recently, there has been a trend toward utilizing graph convolutional neural networks (GCNs), which represent the joint positions of a person in an image in the form of a skeleton graph and apply a convolutional neural network to estimate 3D poses.

하지만, 종래의 그래프 합성곱 신경망의 경우 그래프 내 각 노드의 특징을 업데이트하는 과정에서 임의의 노드와 인접한 노드의 특징 위주로 학습을 수행하고, 상대적으로 거리가 먼 노드에 대한 장거리 종속성(long-range dependency)이 충분히 모델링되지 않아 3D 자세 추정의 정확도가 감소하는 문제점이 있다. However, in the case of conventional graph convolutional neural networks, there is a problem in that learning is performed mainly on the features of nodes adjacent to a random node in the process of updating the features of each node in the graph, and long-range dependencies for nodes that are relatively far away are not sufficiently modeled, which reduces the accuracy of 3D pose estimation.

한국등록특허공보 제10-1812379호(2017.12.19. 등록)Korean Patent Publication No. 10-1812379 (Registered on December 19, 2017)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 이미지 내 표현되는 사람의 2차원 자세뿐만 아니라 3차원 자세까지 확장하여 추정할 수 있는 3D 자세 추정 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다. The present invention is intended to solve the problems of the above-mentioned prior art, and to provide a 3D pose estimation device, method, and computer program capable of estimating not only a 2D pose of a person expressed in an image but also a 3D pose.

또한, 본 발명은 이미지 내 표현되는 사람의 자세를 분석하여 3D 자세를 추정하는 과정에서 장거리 종속성을 효율적으로 반영하여 자세 추정의 정확성이 향상된 3D 자세 추정 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.In addition, the present invention aims to provide a 3D pose estimation device, method, and computer program that efficiently reflect long-distance dependency in the process of estimating a 3D pose by analyzing the pose of a person expressed in an image, thereby improving the accuracy of pose estimation.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical tasks that this embodiment seeks to accomplish are not limited to the technical tasks described above, and other technical tasks may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 사람이 포함되는 이미지를 입력받는 수신부; 상기 이미지를 분석하여 상기 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 상기 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성하는 2D 자세 추정부; 및 인접행렬을 이용하여 상기 스켈레톤 그래프에 대응되는 3D 자세를 추정하는 3D 자세 추정부; 를 포함하고, 상기 인접행렬은 k 번째(0<k<K, 여기서 K는 2 이상의 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬이 사용되는 3D 자세 추정 장치를 제공할 수 있다. As a means for achieving the above-described technical task, one embodiment of the present invention includes a receiving unit which receives an image including a person; a 2D pose estimation unit which analyzes the image to extract types and positions of joints according to the posture of the person, and generates a skeleton graph based on the types and positions of the extracted joints; and a 3D pose estimation unit which estimates a 3D pose corresponding to the skeleton graph using an adjacency matrix; wherein the adjacency matrix can provide a 3D pose estimation device in which different adjacency matrices are used corresponding to each repeated process in a process of deriving a k-th (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

본 발명의 다른 실시예는, 3D 자세 추정 장치에 의해 수행되는 3D 자세 추정 방법에 있어서, 사람이 포함되는 이미지를 입력받는 수신 단계; 상기 이미지를 분석하여 상기 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 상기 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성하는 2D 자세 추정 단계; 및 인접행렬을 이용하여 상기 스켈레톤 그래프에 대응되는 3D 자세를 추정하는 3D 자세 추정 단계; 를 포함하고, 상기 인접행렬은 k 번째(0<k<K, 여기서 K는 2 이상의 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬이 사용될 수 있다.Another embodiment of the present invention provides a 3D pose estimation method performed by a 3D pose estimation device, comprising: a receiving step of inputting an image including a person; a 2D pose estimation step of analyzing the image to extract types and positions of joints according to the pose of the person, and generating a skeleton graph based on the types and positions of the extracted joints; and a 3D pose estimation step of estimating a 3D pose corresponding to the skeleton graph using an adjacency matrix; wherein, in the adjacency matrix, different adjacency matrices may be used corresponding to each repeated process in the process of deriving the k-th (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

본 발명의 또 다른 실시예는 이미지에서 3D 자세를 추정하는 명령어들의 시퀀스를 포함하는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램에 있어서, 사람이 포함되는 이미지를 입력받는 수신하고, 상기 이미지를 분석하여 상기 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 상기 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성하는 2D 자세 추정하며, 인접행렬을 이용하여 상기 스켈레톤 그래프에 대응되는 3D 자세를 추정하되, 상기 인접행렬은 k 번째(0<k<K, 여기서 K는 2 이상의 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬이 사용되도록 하는 명령어들의 시퀀스를 포함할 수 있다.Another embodiment of the present invention is a computer program stored on a computer-readable recording medium including a sequence of commands for estimating a 3D pose from an image, the computer program including the sequence of commands for receiving an image including a person, analyzing the image to extract types and positions of joints according to the pose of the person, estimating a 2D pose by generating a skeleton graph based on the types and positions of the extracted joints, and estimating a 3D pose corresponding to the skeleton graph using an adjacency matrix, wherein the adjacency matrix may include a sequence of commands for using different adjacency matrices corresponding to each repeated process in the process of deriving a k-th (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the above-described exemplary embodiments, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 이미지 내 표현되는 사람의 2차원 자세뿐만 아니라 3차원 자세까지 확장하여 추정할 수 있는 3D 자세 추정 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.According to any one of the problem solving means of the present invention described above, a 3D pose estimation device, method and computer program capable of estimating not only a 2D pose of a person expressed in an image but also a 3D pose can be provided.

또한, 본 발명은 이미지 내 표현되는 사람의 자세를 분석하여 3D 자세를 추정하는 과정에서 장거리 종속성을 효율적으로 반영하여 자세 추정의 정확성이 향상된 3D 자세 추정 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.In addition, the present invention can provide a 3D pose estimation device, method, and computer program that efficiently reflect long-distance dependency in the process of estimating a 3D pose by analyzing the pose of a person expressed in an image, thereby improving the accuracy of pose estimation.

도 1은 본 발명의 일 실시예에 따른 3D 자세 추정 장치의 구성도이다.
도2a 내지 도 2c는 본 발명의 일 실시예에 따른 스켈레톤 그래프를 예시하여 설명하기 위한 도면이다.
도 3a 내지 도3c는 본 발명의 일 실시예에 따른 인접행렬의 구성을 예시하여 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 리프팅 네트워크의 구성을 예시하여 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따라 이미지로부터 추정된 3D 자세를 예시하여 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 3D 자세 추정 방법의 흐름도이다. FIG. 1 is a configuration diagram of a 3D pose estimation device according to one embodiment of the present invention.
FIGS. 2A to 2C are drawings for explaining an example of a skeleton graph according to one embodiment of the present invention.
FIGS. 3A to 3C are drawings for explaining the configuration of an adjacency matrix according to one embodiment of the present invention.
FIG. 4 is a drawing for explaining the configuration of a lifting network according to one embodiment of the present invention.
FIG. 5 is a drawing for explaining an example of a 3D pose estimated from an image according to one embodiment of the present invention.
FIG. 6 is a flowchart of a 3D pose estimation method according to one embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, with reference to the attached drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts that are not related to the description are omitted, and similar parts are assigned similar drawing reference numerals throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element in between. Also, when a part is said to "include" a certain component, this should be understood to mean that it may further include other components, unless specifically stated to the contrary, and does not preclude in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, the term 'unit' includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be realized using two or more pieces of hardware, and two or more units may be realized by one piece of hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described as being performed by a terminal or device in this specification may instead be performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by a server may also be performed by a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 3D 자세 추정 장치의 구성도이다. 도 1을 참조하면, 3D 자세 추정 장치(200)는 수신부(210), 2D 자세 추정부(220) 및 3D 자세 추정부(230)를 포함할 수 있다. 다만, 위 구성 요소들(210, 220, 230)은 3D 자세 추정 장치(200)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것일 뿐이다.FIG. 1 is a configuration diagram of a 3D attitude estimation device according to one embodiment of the present invention. Referring to FIG. 1, a 3D attitude estimation device (200) may include a receiving unit (210), a 2D attitude estimation unit (220), and a 3D attitude estimation unit (230). However, the above components (210, 220, 230) are only exemplary components that can be controlled by the 3D attitude estimation device (200).

도 1의 3D 자세 추정 장치(200)는 구성요소들은 일반적으로 네트워크(network)를 통해 연결된다. 예를 들어, 도 2에 도시된 바와 같이, 수신부(210), 2D 자세 추정부(220) 및 3D 자세 추정부(230)는 동시에 또는 시간 간격을 두고 연결될 수 있다.The components of the 3D pose estimation device (200) of Fig. 1 are generally connected via a network. For example, as shown in Fig. 2, the receiving unit (210), the 2D pose estimation unit (220), and the 3D pose estimation unit (230) may be connected simultaneously or at time intervals.

네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다. A network is a connection structure that enables information exchange between each node, such as terminals and servers, and includes a local area network (LAN), a wide area network (WAN), the Internet (WWW: World Wide Web), wired and wireless data communication networks, telephone networks, and wired and wireless television communication networks. Examples of wireless data communication networks include, but are not limited to, 3G, 4G, 5G, 3GPP (3rd Generation Partnership Project), LTE (Long Term Evolution), WIMAX (World Interoperability for Microwave Access), Wi-Fi, Bluetooth communication, infrared communication, ultrasonic communication, visible light communication (VLC), LiFi, etc.

3D 자세 추정 장치(200)의 수신부(210)는 사람이 포함되는 이미지(100)를 입력받을 수 있다. 여기서 이미지(100)는 다수의 픽셀로 구성되며 각 픽셀은 RGB의 비율에 따른 특성을 가질 수 있다. 이미지(100)는 단일의 이미지이거나 시계열적으로 재생되는 영상에 포함되는 복수의 프레임 중 하나일 수 있다. 3D 자세 추정 장치(200)가 연속되는 이미지를 수신하는 경우 3D 자세의 변화에 따른 사람의 행동을 추정할 수 있게 된다.The receiving unit (210) of the 3D posture estimation device (200) can receive an image (100) including a person. Here, the image (100) is composed of a plurality of pixels, and each pixel can have characteristics according to the ratio of RGB. The image (100) can be a single image or one of a plurality of frames included in an image that is played back in time series. When the 3D posture estimation device (200) receives continuous images, it can estimate the person's behavior according to changes in the 3D posture.

3D 자세 추정 장치(200)에 수신되는 이미지(100)에는 사람이 표현될 수 있으며, 3D 자세 추정 장치(200)는 이미지(100) 내의 사람의 위치와 형태를 분석하여 2D 자세를 추정하고, 추정된 2D 자세를 분석하여 최종 결과물인 추정된 3D 자세(300)를 출력할 수 있다.A person may be expressed in an image (100) received by a 3D pose estimation device (200), and the 3D pose estimation device (200) may analyze the position and shape of the person in the image (100) to estimate a 2D pose, and analyze the estimated 2D pose to output an estimated 3D pose (300), which is the final result.

2D 자세 추정부(220)는 이미지를 분석하여 이미지 내 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성할 수 있다. The 2D pose estimation unit (220) analyzes the image to extract the type and location of joints according to the pose of the person in the image, and can create a skeleton graph based on the type and location of the extracted joints.

2D 자세 추정부(220)가 입력된 이미지로부터 2D 자세를 추정하는 과정에서 종래에 공개된 다양한 기술들이 사용될 수 있다. 예를 들어, 이미지 내에서 사람의 관절을 모두 추정한 후 이를 특정한 포즈 또는 하나의 사람 객체의 포즈로 그룹화하는 방식의 상향식(Bottom-up)이나 사람 검출기를 이용하여 이미지 내 사람 객체를 추출한 후 추출된 사람 객체에서 관절을 추정하는 하향식(Top-down) 방법이 사용될 수 있다.In the process of estimating a 2D pose from an input image by the 2D pose estimation unit (220), various techniques that have been disclosed in the past can be used. For example, a bottom-up method can be used in which all joints of a person are estimated in an image and then grouped into a specific pose or a pose of a single human object, or a top-down method can be used in which a human object is extracted from an image using a human detector and then joints are estimated from the extracted human object.

다른 예시로서 2D 자세의 추정을 위해 고전적인 방법으로서 엔지니어들이 직접 사람의 각 부위를 특징 추출하는 수작업(hand-crafted) 방식이 예시될 수 있고, 최근 각광받는 딥러닝 기반의 2D 자세 추정이 사용될 수 있다. 딥러닝에 기반하는 2D 자세 추정은 이미지 내 사람의 위치하는 구역에 대응하는 바운딩 박스(Bounding box)를 도출한 후, 바운딩 박스 내 사람이나 물체의 키포인트를 찾아내는 방식으로 사용될 수 있다. 여기서 키포인트는 사람의 팔꿈치, 무릎, 손목과 같은 관절 정보를 의미할 수 있다. 딥러닝을 활용하는 2D 자세 추정 방식으로서 대표적으로 OpenPose, CPN, AlphaPose, HRNet과 같은 자세 추정 모델들이 활용될 수 있다.As another example, a classical method for estimating 2D poses can be exemplified by a hand-crafted method in which engineers directly extract features from each part of a person, and a recently popular deep learning-based 2D pose estimation can be used. 2D pose estimation based on deep learning can be used by deriving a bounding box corresponding to the area where a person is located in an image, and then finding key points of a person or object within the bounding box. Here, the key points can mean joint information such as a person's elbow, knee, and wrist. Representative pose estimation models such as OpenPose, CPN, AlphaPose, and HRNet can be utilized as 2D pose estimation methods that utilize deep learning.

2D 자세 추정부(220)는 이미지로부터 2D 자세를 추정한 결과로서, 이미지 내 사람의 키포인트에 대응되는 관절 정보를 출력할 수 있다. 여기서 관절 정보는 해당 관절이 이미지내 위치하는 좌표값이나 해당 관절이 이미지에서 보이는 지 여부에 대한 정보를 포함할 수 있으며, 이러한 관절 정보는 스켈레톤 그래프 상에서 각 노드의 특징 정보로 활용될 수 있다. The 2D pose estimation unit (220) can output joint information corresponding to key points of a person in the image as a result of estimating a 2D pose from an image. Here, the joint information can include coordinate values where the corresponding joint is located in the image or information on whether the corresponding joint is visible in the image, and such joint information can be utilized as feature information of each node on the skeleton graph.

여기서 스켈레톤 그래프는 신체 골격 구조로 구성하는 관절 중 자세 추정에 필요한 주요 관절(예를 들어, 머리, 어깨, 팔꿈치, 손, 골반, 무릎, 발)을 그래프 상의 노드(node)로 설정하고, 인체에서 서로 연결되는 관절을 그래프 상의 엣지(edge)로 설정한 그래프를 의미한다. Here, the skeleton graph refers to a graph in which the main joints (e.g., head, shoulder, elbow, hand, pelvis, knee, foot) necessary for posture estimation among the joints that make up the body skeletal structure are set as nodes on the graph, and the joints that are connected to each other in the human body are set as edges on the graph.

도2a 내지 도2c는 본 발명의 일 실시예에 따른 스켈레톤 그래프를 예시하여 설명하기 위한 도면이다. 도2a 내지 도2c를 참조하면, 0번 노드부터 16번 노드까지 17개의 노드로 구성된 그래프가 도시되었으며, 각 노드는 인체의 주요 관절에 대응될 수 있다. 예를 들어, 0번 노드는 골반에 대응되는 노드이고, 1번 노드와 4번 노드는 각각 좌우측 엉덩이 관절에 대응되는 노드이고, 7번 노드는 허리에 대응되는 노드일 수 있다. 또한, 각 노드의 번호는 골반 위치에 대응되는 노드를 루트로 하는 트리 또는 그래프에서 DFS(Depth First Search) 방식으로 조회되는 순서대로 그 노드 번호를 설정할 수 있다. 다만, 이는 설명의 편의를 위해 예시적으로 설정한 것이며 설계자의 의도나 모델 학습 환경에 따라 다른 노드 번호가 부여될 수 있다. FIGS. 2A to 2C are drawings for explaining skeleton graphs according to an embodiment of the present invention. Referring to FIGS. 2A to 2C, a graph composed of 17 nodes from node 0 to node 16 is illustrated, and each node may correspond to a major joint of the human body. For example, node 0 may correspond to a node corresponding to the pelvis, nodes 1 and 4 may correspond to the left and right hip joints, respectively, and node 7 may correspond to the waist. In addition, the number of each node may be set in the order in which the node corresponding to the pelvis position is searched in a tree or graph with the node corresponding to the pelvis position as the root in a DFS (Depth First Search) manner. However, this is set as an example for convenience of explanation, and different node numbers may be assigned depending on the designer's intention or the model learning environment.

도 2a는 기준이 되는 0번 노드로부터 1홉 거리만큼 떨어진 노드(1번, 4번 및 7번 노드)를 나타낸 것이고, 도 2b는 기준이 되는 0번 노드로부터 2홉 거리만큼 떨어진 노드(2번, 5번 및 8번 노드)를 나타낸 것이며, 도 2c는 기준이 되는 0번 노드로부터 3홉 거리만큼 떨어진 노드(3번, 6번, 9번, 11번 및 14번 노드)들을 나타낸 것이다. Figure 2a shows nodes (nodes 1, 4, and 7) that are 1 hop away from the reference node 0, Figure 2b shows nodes (nodes 2, 5, and 8) that are 2 hops away from the reference node 0, and Figure 2c shows nodes (nodes 3, 6, 9, 11, and 14) that are 3 hops away from the reference node 0.

다시 도 1로 돌아와, 3D 자세 추정부(230)는 인접행렬, 특징행렬 및 가중행렬을 이용하여 2D 자세 추정부(220)에서 출력된 스켈레톤 그래프에 대응되는 3D 자세를 추정할 수 있다. 여기서 인접행렬(Adjacency matrix)은 스켈레톤 그래프 내 노드 간의 연결 관계에 기초하고, 특징행렬(Feature matrix)은 스켈레톤 그래프의 노드가 각각 포함하는 특징에 기초하며, 가중행렬(Weight matrix)은 노드의 특징에 각각 적용되는 필터값에 기초할 수 있다.Returning to FIG. 1 again, the 3D pose estimation unit (230) can estimate a 3D pose corresponding to the skeleton graph output from the 2D pose estimation unit (220) using an adjacency matrix, a feature matrix, and a weight matrix. Here, the adjacency matrix is based on the connection relationship between nodes in the skeleton graph, the feature matrix is based on the features included in each node of the skeleton graph, and the weight matrix can be based on the filter value applied to each feature of the node.

보다 구체적으로, 인접행렬은 스켈레톤 그래프를 구성하는 노드의 수가 N개 일 때, N x N의 크기를 가지며 0 또는 1을 원소로 가지는 행렬일 수 있다. 특징행렬은 각 노드의 특징에 대한 정보를 포함하는 행렬로서 2D 자세 추정부(220)에서 추정된 각 관절의 이미지 내 좌표값이나 이미지내 좌표가 표시되는 지 여부에 대한 정보를 포함할 수 있다. 각 노드의 특징에 대한 차원 값 또는 특징의 수를 D라고 할 때, 특징행렬은 D x N의 크기를 가질 수 있다. 그리고 가중행렬은 합성곱 신경망을 통해 학습 가능한 가중치에 대한 행렬로서 특징 차원을 D 로부터 D'으로 변환한다고 가정할 때, D' x D의 크기를 가지는 행렬일 수 있다. More specifically, the adjacency matrix may be a matrix having a size of N x N and having 0 or 1 as an element when the number of nodes constituting the skeleton graph is N. The feature matrix is a matrix including information on the feature of each node and may include information on the coordinate values of each joint estimated by the 2D pose estimation unit (220) within the image or whether the coordinates within the image are displayed. When the dimension value or the number of features for the feature of each node is D, the feature matrix may have a size of D x N. In addition, the weight matrix may be a matrix having a size of D' x D assuming that the feature dimension is converted from D to D' as a matrix for weights that can be learned through a convolutional neural network.

종래의 그래픽 합성곱 신경망은 각 노드의 특징을 업데이트하는 과정에서 임의의 노드와 연결된 노드의 관계를 반영하기 위해 인접행렬을 사용한다. 인접행렬은 그래프의 구조를 표현하기 위해 노드 수만큼의 열과 행을 가진 행렬을 의미하며, 예를 들어, 그래프 내 노드 i 와 노드 j 가 엣지를 통해 연결된 경우, 인접행렬의 i행 j열의 원소는 1이 되고 그렇지 않으면 0으로 설정된다. 종래의 그래프 합성곱 신경망에서 사용되는 인접행렬은 임의의 노드가 해당 노드 자신과 연결되는 자기연결(self-connection)을 반영하여, 행렬의 대각 성분을 1로 설정한다. 이를 통해, 노드의 특징을 업데이트하는 과정에서 노드 자신의 특징과 해당 노드와 인접한 다른 노드의 특징을 같이 반영하게 된다. 여기서 인접한 노드는 기준이 되는 노드와 직접 연결되는 엣지를 가진 노드로서, 기준이 되는 노드와 1홉 거리의 노드를 의미한다. 또한, 그래프의 엣지가 방향성이 없는 경우 인접행렬은 대각 성분을 기준으로 서로 대칭하는 대칭행렬의 형태를 가지게 된다. In the process of updating the features of each node, conventional graphical convolutional neural networks use an adjacency matrix to reflect the relationship between nodes connected to arbitrary nodes. An adjacency matrix is a matrix with as many rows and columns as the number of nodes to express the structure of a graph. For example, if nodes i and j in a graph are connected via an edge, the elements of row i and column j of the adjacency matrix are set to 1, otherwise they are set to 0. The adjacency matrix used in conventional graph convolutional neural networks reflects the self-connection of an arbitrary node to itself, and sets the diagonal elements of the matrix to 1. Through this, in the process of updating the features of a node, the features of the node itself and the features of other nodes adjacent to the node are reflected together. Here, an adjacent node is a node that has an edge that is directly connected to a reference node, and means a node that is 1 hop away from the reference node. In addition, if the edges of the graph have no direction, the adjacency matrix has the form of a symmetric matrix that is symmetrical to each other based on the diagonal elements.

하지만 위와 같이 기준 노드 및 기준 노드와 인접한 노드에 대한 정보만 반영하여 특징을 업데이트 하면, 보다 먼 홉 거리에 있는 노드의 정보를 제대로 반영할 수 없게 되어 장거리 종속성(long-range dependency)과 관련한 문제가 발생하여 자세 추정의 정확성이 감소하게 된다. However, if the features are updated by reflecting only the information about the reference node and the nodes adjacent to the reference node as above, the information about nodes at a further hop distance cannot be properly reflected, which causes problems related to long-range dependency, reducing the accuracy of detail estimation.

보다 구체적으로, 종래의 그래프 합성곱 신경망의 경우 다음의 수식을 통해 특징 업데이트를 수행하게 된다.More specifically, for conventional graph convolutional neural networks, feature updates are performed using the following formula.

<수식 1><Formula 1>

여기서 H'는 업데이트된 특징행렬을 의미하고, σ는 ReLU로 예시되는 활성함수를 의미하고, W는 학습 가능한 가중행렬을 의미하고, 는 스켈레톤 그래프의 인접행렬에 대해 자기연결(self-connection)을 포함하도록 대칭 정규화된 인접행렬을 의미할 수 있다. 그리고 H는 업데이트되지 않은 특징행렬을 의미할 수 있다.Here, H' represents the updated feature matrix, σ represents the activation function exemplified by ReLU, and W represents the learnable weight matrix. can mean a symmetric normalized adjacency matrix that includes self-connection for the adjacency matrix of the skeleton graph. And H can mean an unupdated feature matrix.

상기한 바와 같이, 자기연결(self-connection)을 포함하도록 대칭 정규화된 인접행렬을 이용하여 특징행렬을 업데이트 하는 경우 가중행렬 W를 각각의 노드에 동일하게 공유하기 때문에 다양한 관계 패턴을 학습하기 어렵게 만드는 문제점이 있다. As mentioned above, when updating the feature matrix using a symmetrically normalized adjacency matrix to include self-connection, there is a problem that it makes it difficult to learn various relationship patterns because the weight matrix W is shared equally by each node.

이러한 문제점을 해결하기 위해 3D 자세 추정부(230)는 특징행렬을 업데이트할 때, 인접행렬의 k번째 (0<k<K, 여기서 K는 설계자가 임의로 설정할 수 있는 2 이상의 자연수이고 k는 K보다 크기가 작은 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬을 사용할 수 있다.To solve this problem, the 3D pose estimation unit (230) can use different adjacency matrices corresponding to each repeated process in the process of deriving the updated feature matrix of the kth adjacency matrix (0<k<K, where K is a natural number greater than or equal to 2 that can be arbitrarily set by the designer and k is a natural number smaller than K) when updating the feature matrix.

보다 구체적으로, 상기 서로 다른 인접행렬은 k번째 업데이트된 특징행렬을 구할 때, 인접행렬이 스켈레톤 그래프 상에서 k 홉 거리만큼 떨어진 노드에 대해서만 값을 1로 가지도록 설정될 수 있다. 여기서 인접행렬은 대각성분이 0인 대칭행렬이며 정규화될 수 있다.More specifically, the different adjacency matrices can be set so that, when obtaining the kth updated feature matrix, the adjacency matrix has a value of 1 only for nodes that are k hops away on the skeleton graph. Here, the adjacency matrix is a symmetric matrix with diagonal elements of 0 and can be normalized.

도 3a 내지 도3c는 본 발명의 일 실시예에 따른 인접행렬의 구성을 예시하여 설명하기 위한 도면이다. 우선 도 3a를 참조하면, k값이 1일 때, 즉 첫번째 업데이트된 특징행렬을 구할 때 사용되는 인접행렬 ₁의 구성을 알 수 있다. 인접행렬 ₁의 구성은 앞서 살펴본 도 2a 내지 2c의 그래프의 노드간 연결 상태를 반영한 것이다. Figures 3a to 3c are drawings for explaining the configuration of an adjacency matrix according to one embodiment of the present invention. First, referring to Figure 3a, when the k value is 1, i.e., the adjacency matrix used when obtaining the first updated feature matrix _1. You can see the composition of the adjacency matrix. The configuration of ₁ reflects the connection status between nodes in the graphs of Figures 2a to 2c examined above.

보다 구체적으로, 도 3a에서 각 행을 표현하는 ₁(n)(n은 자연수)에서 n은 관절의 번호를 의미할 수 있고, 해당 행의 원소들은 각 열의 번호에 대응되는 관절들에 대한 이웃 관계를 의미할 수 있다. 예를 들어, ₁(0)은 0번 관절로부터 1 홉 거리의 이웃 관절을 의미할 수 있다. 따라서, 도 3a의 ₁(0)에 대응되는 행은 1, 4, 7번 열의 원소가 1로 설정되고 나머지 원소들은 1 홉 거리의 이웃 관절이 아니기 때문에 0으로 설정될 수 있다. More specifically, each row in Fig. 3a is represented by ₁ (n) (n is a natural number), n can mean the number of joints, and the elements of the corresponding row can mean the neighboring relationship for the joints corresponding to the numbers of each column. For example, ₁ (0) can mean a neighboring joint 1 hop away from joint 0. Therefore, in Fig. 3a, The row corresponding to ₁ (0) can have the elements in columns 1, 4, and 7 set to 1, and the remaining elements can be set to 0 because they are not neighboring joints within 1 hop.

마찬가지로, 도3a의 ₁(1)은 1번 관절로부터 1 홉 거리의 이웃 관절을 의미할 수 있다. 따라서, 도 3a의 ₁(1)에 대응되는 행은 0, 2번 열의 원소가 1로 설정되고 나머지 원소들은 1 홉 거리의 이웃 관절이 아니기 때문에 0으로 설정될 수 있다.Likewise, in Fig. 3a ₁ (1) may mean a neighboring joint that is 1 hop away from joint 1. Therefore, in Fig. 3a, The row corresponding to ₁ (1) can have the elements in columns 0 and 2 set to 1 and the remaining elements set to 0 because they are not neighboring joints within 1 hop.

상기한 바와 같이, 본 발명의 실시예에 따라 사용되는 인접행렬은 k번째 특징행렬을 업데이트할 때, k 홉거리에 연결되는 노드들에 대해서만 1로 설정하고, 이 외의 노드에 대해서는 0으로 설정한 것을 알 수 있으며, 대각성분도 0인 것을 알 수 있다. 도 3a 내지 도 3c는 설명의 편의를 위해 인접행렬 중 첫 번째와 두번째 행을 예시적으로 표현한 것이며 이 외의 행에 대한 정보는 생략하였다. As described above, it can be seen that the adjacency matrix used according to the embodiment of the present invention sets only the nodes connected to the k-hop distance to 1 when updating the k-th feature matrix, and sets the other nodes to 0, and that the diagonal elements are also 0. For convenience of explanation, FIGS. 3A to 3C exemplarily represent the first and second rows of the adjacency matrix, and information on the other rows is omitted.

도 3b는 상기한 k 값이 2일 때 사용되는 인접행렬 ₂를 예시한다. 도 3b의 인접행렬은 도 2b의 그래프 상에서 0번 노드와 2 홉 거리에 위치한 2번, 5번, 8번 노드에 대한 값이 1로 설정되고, 1홉 거리에 위치한 1번, 4번, 7번 노드에 대한 값이 0으로 설정된 것을 알 수 있다.Figure 3b is an adjacency matrix used when the above k value is 2. ₂ is an example. It can be seen that the adjacency matrix of Fig. 3b sets the values for nodes 2, 5, and 8, which are located 2 hops away from node 0 in the graph of Fig. 2b, to 1, and the values for nodes 1, 4, and 7, which are located 1 hop away, to 0.

도 3c는 상기한 k 값이 3일 때 사용되는 인접행렬 ₃를 예시한다. 도 3c의 인접행렬은 도 2c의 그래프 상에서 0번 노드와 3 홉 거리에 위치한 3번, 6번, 9번, 11번 및 14번 노드에 대한 값이 1로 설정되고, 1홉 또는 2홉 거리에 노드에 대한 값이 0으로 설정된 것을 알 수 있다.Figure 3c is an adjacency matrix used when the above k value is 3. ₃ is an example. It can be seen that the adjacency matrix of Fig. 3c sets the values for nodes 3, 6, 9, 11, and 14 located 3 hops away from node 0 in the graph of Fig. 2c to 1, and the values for nodes located 1 or 2 hops away are set to 0.

본 발명의 실시예에서 사용되는 인접행렬은 대각성분이 0인 대칭행렬이므로, 특징행렬의 업데이트 과정에서 노드의 자기연결이 반영될 수 없으므로, 노드 자신에 대한 특징행렬을 업데이트된 특징행렬의 계산에 추가로 반영하기 위해 <수식 2>와 같이 업데이트된 특징행렬을 계산할 필요가 있다. Since the adjacency matrix used in the embodiment of the present invention is a symmetric matrix with diagonal elements of 0, the self-connection of the node cannot be reflected in the process of updating the feature matrix, so it is necessary to calculate the updated feature matrix as in <Equation 2> in order to additionally reflect the feature matrix for the node itself in the calculation of the updated feature matrix.

<수식 2><Formula 2>

여기서 H'은 업데이트된 특징행렬을 의미하고, σ는 활성함수를 의미하며, W(0)은 노드 자신에 대응되는 가중행렬을 의미하며, W(1)는 이웃 변환에 대응되는 가중행렬을 의미하고, H는 업데이트되지 않은 특징행렬을 의미하며, 는 상기 인접행렬을 의미할 수 있다. 즉, 는 대각성분이 대각성분이 0이며 정규화된 대칭행렬일 수 있다. <수식 2>와 같이 인접행렬의 대각성분에 해당하는 자기연결과 인접한 노드간의 연결을 분리하여 각각에 대해 특징행렬을 계산하여 더함으로써 3D 자세 추정의 정확성을 보다 향상시킬 수 있게 된다.Here, H' represents the updated feature matrix, σ represents the activation function, W(0) represents the weight matrix corresponding to the node itself, W(1) represents the weight matrix corresponding to the neighboring transformation, and H represents the non-updated feature matrix. can mean the above adjacency matrix. That is, can be a normalized symmetric matrix with diagonal elements of 0. As in <Formula 2>, by separating the self-connection corresponding to the diagonal elements of the adjacency matrix and the connection between adjacent nodes and calculating and adding the feature matrix for each, the accuracy of 3D pose estimation can be further improved.

본원발명의 실시예에 따른 인접행렬은 k 홉 거리에 위치하는 노드의 연결에 대해서만 1로 설정될 수 있다. 제안된 인접행렬은 이웃하는 관절들 간의 관계를 나타냄에 있어서 k 홉 거리보다 적은 홉 거리의 노드의 연결에 대해서 0으로 설정하기에, k 홉거리보다 적은 노드들은 대해선 보다 적은 상관성을 가지며, 이를 통해 관절들 간의 장거리 종속성을 효율적으로 반영하여 수 있어 3D 자세 추정의 정확성을 상승시킬 수 있게 된다.The adjacency matrix according to the embodiment of the present invention can be set to 1 only for connections of nodes located at a distance of k hops. Since the proposed adjacency matrix sets connections of nodes at a distance of less than k hops to 0 in representing relationships between neighboring joints, nodes at a distance of less than k hops have less correlation, and through this, long-distance dependencies between joints can be efficiently reflected, thereby increasing the accuracy of 3D pose estimation.

본 발명의 다른 실시예에 있어서, 3D 자세 추정의 정확성을 보다 향상시키기 위해 보다 앞선 실시예에서 추가적으로 각 홉의 집계된 특징들을 병합하기 위한 변조(modulation) 방법이 사용될 수 있다. In another embodiment of the present invention, a modulation method may be used to additionally merge the aggregated features of each hop in the previous embodiment to further improve the accuracy of 3D pose estimation.

<수식 3><Formula 3>

여기서 H'은 업데이트된 특징행렬을 의미하고, σ는 활성함수를 의미하며, λ_k는 각k 홉 거리의 집계된 특징들(WH)과 k+1 홉 거리까지 병합된 특징들(C_k+1)간의 관계를 모델링하기 위해 학습 가능한 변조 행렬을 의미하고, W는 가중행렬을 의미하며, H는 특징행렬을 의미하고, 는 상기 k 홉 거리에 대한 인접행렬로서 앞서 도 3a 내지 도3c를 통해 설명된 인접행렬을 의미할 수 있다. 그리고 C_k+1은 k+1 홉 거리까지 병합된 특징을 의미할 수 있다.Here, H' represents the updated feature matrix, σ represents the activation function, and λ _k represents the aggregated features (WH) of each k-hop distance. ) and the features merged up to k+1 hop distance (C _k+1 ) are modeled as a learnable modulation matrix, W is a weighting matrix, H is a feature matrix, is an adjacency matrix for the k-hop distance, and may refer to the adjacency matrix described above through FIGS. 3a to 3c. And C _k+1 may refer to features merged up to the k+1-hop distance.

즉, 장거리 종속성을 보다 효율적으로 학습 모델에 반영하기 위하여, k 홉 거리의 특징 표현에 있어서, k+1 홉 거리까지의 병합된 특징들 C_k+1에 의해 영향을 받도록 할 수 있고, 이를 위해 변조행렬 λ_k이 사용될 수 있다. k가 0<k<K의 범위를 만족하고 K값은 사용자에 의해 정의된다고 할 때, C_k는 k 홉 거리까지 집계된 특징을 의미한다. 예를 들어, K의 값이 4로 설정되어 k의 범위가 1 부터 3일 때, C₂는 3 홉 거리에 해당하는 특징들과 2 홉 거리에 해당하는 특징들의 결합으로 계산될 수 있다. 보다 상세히, C_k는 λ_k와 같이 각 홉 거리마다 계산된 특징들을 변조하는 학습 가능한 행렬을 통해 특징들을 집계함으로써 계산될 수 있다. 이를 통해 이웃하는 관절과의 홉 거리가 멀수록 작은 가중치를 갖도록 할 수 있다. That is, in order to reflect the long-distance dependency more efficiently in the learning model, the feature representation of the k-hop distance can be affected by the merged features C _k+1 up to the k+1-hop distance, and for this, the modulation matrix λ _k can be used. When k satisfies the range of 0<k<K and the K value is defined by the user, C _k means the feature aggregated up to the k-hop distance. For example, when the value of K is set to 4 and the range of k is from 1 to 3, C ₂ can be calculated by the combination of the features corresponding to the 3-hop distance and the features corresponding to the 2-hop distance. In more detail, C _k can be calculated by aggregating the features through a learnable matrix that modulates the features calculated for each hop distance, such as λ _k . Through this, it can be made to have a smaller weight as the hop distance from the neighboring joint is farther.

도 4는 본 발명의 일 실시예에 따른 리프팅 네트워크의 구성을 예시하여 설명하기 위한 도면이다. 도 4의 리프팅 네트워크는 3D 자세 추정부(230)가 스켈레톤 그래프로부터 3D 자세를 추정하는 과정에서 사용되는 모델일 수 있다.Fig. 4 is a drawing for explaining the configuration of a lifting network according to one embodiment of the present invention. The lifting network of Fig. 4 may be a model used in the process of estimating a 3D posture from a skeleton graph by a 3D posture estimation unit (230).

리프팅 네트워크는 복수의 블록을 포함하고, 복수의 상기 블록은 각각 합성곱 레이어(Convolution layer), 일괄 정규화 및 ReLU 활성 함수 레이어(BatchNorm & ReLU layer)를 포함할 수 있다. The lifting network includes multiple blocks, and each of the multiple blocks may include a convolution layer, a batch normalization layer, and a ReLU activation function layer (BatchNorm & ReLU layer).

마지막 순서의 블록을 제외한 나머지 블록에서 합성곱 레이어의 출력은 일괄 정규화 및 ReLU 활성 함수 레이어를 통과할 수 있다. 도 4에서 예시되는 네트워크 구조는 채널의 수가 128로 설정되었으며 이는 사용자의 설계 의도에 따라 변경될 수 있다.Except for the last block, the output of the convolution layer in the remaining blocks can pass through the batch normalization and ReLU activation function layers. The network structure exemplified in Fig. 4 has the number of channels set to 128, which can be changed according to the user's design intention.

그리고, 본 발명의 리프팅 네트워크는 적어도 2개의 블록을 하나의 잔차 블록(residual block)으로 구성할 수 있고, 리프팅 네트워크는 이러한 잔차 블록을 다수 포함할 수 있다. And, the lifting network of the present invention can configure at least two blocks as one residual block, and the lifting network can include a plurality of such residual blocks.

또한, 리프팅 네트워크는 스킵 커넥션(Skip connenction)을 통해 리프팅 네트워크를 구성하는 블록 중 제1 블록에 대한 입력 데이터가 제2 블록의 출력 데이터와 합쳐져 제3 블록에 제공될 수 있다. 여기서, 제2 블록은 상기 제1 블록의 다음 순서의 블록이고, 제3 블록은 상기 제2 블록의 다음 순서의 블록일 수 있다. 보다 구체적으로, 제1 블록과 제2 블록은 하나의 잔차 블록을 구성하며, 제3 블록은 그 다음 순서의 잔차 블록에 포함될 수 있다.In addition, the lifting network may be provided to a third block by merging input data of a first block among the blocks constituting the lifting network with output data of a second block through a skip connection. Here, the second block may be a block in the next order of the first block, and the third block may be a block in the next order of the second block. More specifically, the first block and the second block may constitute one residual block, and the third block may be included in the residual block in the next order.

마지막 블록의 합성곱 레이어는 추가적인 모듈을 적용하지 않으며, 각 관절의 3차원 위치 값을 3D 자세 정보로서 출력한다.The convolutional layer of the last block does not apply any additional modules and outputs the 3D position values of each joint as 3D pose information.

도 5는 본 발명의 일 실시예에 따라 이미지로부터 추정된 3D 자세를 예시하여 설명하기 위한 도면이다. FIG. 5 is a drawing for explaining an example of a 3D pose estimated from an image according to one embodiment of the present invention.

도 5는 본 발명의 실시예에 따라 데이터세트에 대한 3D 자세 추정의 결과를 예시하는 것으로서, 입력되는 이미지(Input)에 대해 3D 자세 추정 장치(200)가 추정한 3D 자세 정보의 결과 값(MM-GCN)과 실제 값(Ground truth)을 나타낸다. 결과 값(MM-GCN)과 실제 값(Ground truth)의 비교를 통해 다양한 활동을 수행중인 서로 다른 사람들의 3D 자세를 정확하게 추정한 것을 알 수 있다. 리프팅 모델의 정확도를 측정하기 위한 데이터세트로서 Human3.6M 데이터 세트가 사용될 수 있다. FIG. 5 is an example of a result of 3D pose estimation for a dataset according to an embodiment of the present invention, showing a result value (MM-GCN) and a true value (Ground truth) of 3D pose information estimated by a 3D pose estimation device (200) for an input image (Input). Through a comparison of the result value (MM-GCN) and the true value (Ground truth), it can be seen that 3D poses of different people performing various activities are accurately estimated. The Human3.6M dataset can be used as a dataset for measuring the accuracy of the lifting model.

도 6은 본 발명의 실시예에 따른 3D 자세 추정 방법을 설명하기 위한 흐름도이다. 도 6에 도시된 3D 자세 추정 방법은 도 1 내지 도 5에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이 있더라도 도 1 내지 도 5에 도시된 실시예에 따른 3D 자세 추정 장치에서 사람의 3D 자세를 추정하는 방법에도 적용된다.FIG. 6 is a flowchart for explaining a 3D pose estimation method according to an embodiment of the present invention. The 3D pose estimation method illustrated in FIG. 6 includes steps that are processed in time series according to the embodiments illustrated in FIGS. 1 to 5. Therefore, even if there is any omitted content below, it is also applied to a method for estimating a 3D pose of a person in a 3D pose estimation device according to the embodiments illustrated in FIGS. 1 to 5.

도 6을 참조하면, 3D 자세 추정 장치에 의해 수행되는 3D 자세 추정 방법은 수신 단계(S100), 2D 자세 추정 단계(S200) 및 3D 자세 추정 단계(S300)를 포함할 수 있다. Referring to FIG. 6, a 3D pose estimation method performed by a 3D pose estimation device may include a receiving step (S100), a 2D pose estimation step (S200), and a 3D pose estimation step (S300).

3D 수신 단계(S100)는 사람이 포함되는 이미지를 입력받는 단계일 수 있다. 또한, 2D 자세 추정 단계(S200)는 이미지를 분석하여 사람의 자세에 따른 관절의 종류와 위치를 추출하고, 추출된 관절의 종류와 위치에 기초하여 스켈레톤 그래프를 생성하는 단계일 수 있고, 3D 자세 추정 단계(S300)는 인접행렬을 이용하여 상기 스켈레톤 그래프에 대응되는 3D 자세를 추정하는 단계일 수 있다. 여기서 인접행렬은 k 번째(0<k<K, 여기서 K는 설계자가 임의로 설정할 수 있는 2 이상의 자연수) 업데이트된 특징행렬을 도출하는 과정에서 반복되는 각 과정에 대응하여 서로 다른 인접행렬이 사용될 수 있다. The 3D receiving step (S100) may be a step of receiving an image including a person. In addition, the 2D pose estimation step (S200) may be a step of analyzing the image to extract the types and positions of joints according to the person's pose, and generating a skeleton graph based on the types and positions of the extracted joints, and the 3D pose estimation step (S300) may be a step of estimating a 3D pose corresponding to the skeleton graph using an adjacency matrix. Here, the adjacency matrix may use different adjacency matrices corresponding to each repeated process in the process of deriving the k-th (0<k<K, where K is a natural number greater than or equal to 2 that can be arbitrarily set by the designer) updated feature matrix.

상술한 설명에서, 단계 S100 내지 S300 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S100 to S300 may be further divided into additional steps or combined into fewer steps, depending on the implementation example of the present invention. In addition, some steps may be omitted as needed, or the order between the steps may be switched.

도 1 내지 도 6을 통해 설명된 3D 자세 추정 장치에서 사람의 3D 자세를 추정하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어들을 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 6을 통해 설명된 3D 자세 추정 장치에서 사람의 3D 자세를 추정하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method for estimating a 3D posture of a person in the 3D posture estimation device described through FIGS. 1 to 6 can also be implemented in the form of a computer program stored in a computer-readable recording medium executed by a computer or a recording medium including commands executable by a computer. In addition, the method for estimating a 3D posture of a person in the 3D posture estimation device described through FIGS. 1 to 6 can also be implemented in the form of a computer program stored in a computer-readable recording medium executed by a computer.

컴퓨터 판독 가능 기록매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 기록매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Computer-readable recording media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Additionally, computer-readable recording media can include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes, and those skilled in the art will understand that the present invention can be easily modified into other specific forms without changing the technical idea or essential characteristics of the present invention. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single component may be implemented in a distributed manner, and likewise, components described as distributed may be implemented in a combined manner.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the claims described below rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention.

100: 입력 이미지
200: 3D 자세 추정 장치
210: 수신부
220: 2D 자세 추정부
230: 3D 자세 추정부
300: 추정된 3D 자세 정보100: Input image
200: 3D pose estimation device
210: Receiver
220: 2D pose estimation unit
230: 3D pose estimation unit
300: Estimated 3D pose information

Claims

A receiving unit for receiving an image containing a person;
A 2D pose estimation unit that analyzes the image above to extract the types and locations of joints according to the person's pose, and generates a skeleton graph based on the types and locations of the extracted joints; and
A 3D pose estimation unit that estimates a 3D pose corresponding to the skeleton graph using an adjacency matrix;
Including,
A 3D pose estimation device, characterized in that different adjacency matrices are used corresponding to each repeated process in the process of deriving the kth (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

In the first paragraph, the 3D pose estimation unit
By updating the feature matrix using the feature matrix and weight matrix together in the above adjacency matrix, the 3D pose corresponding to the skeleton graph is estimated,
The above adjacency matrix is based on the connection relationship between nodes in the above skeleton graph,
The above feature matrix is based on the features contained in each node,
A 3D pose estimation device, characterized in that the above weighting matrix is based on filter values applied to each of the above features.

In the second paragraph,
The above 3D pose estimation unit updates the feature matrix of each node constituting the skeleton graph.
An updated feature matrix is derived by multiplying the above feature matrix by the above adjacency matrix and the above weight matrix,
A 3D pose estimation device characterized in that the 3D pose is estimated by repeating the process of obtaining a further updated feature matrix by multiplying the adjacency matrix and the weight matrix for the updated feature matrix.

In the third paragraph,
A 3D pose estimation device, characterized in that when obtaining the kth updated feature matrix, the adjacency matrix has a value of 1 only for nodes that are k hops away on the skeleton graph.

In paragraph 4,
A 3D pose estimation device characterized in that the above adjacency matrix is a symmetric matrix with diagonal elements of 0 and is normalized.

In paragraph 4,
The above updated feature matrix is calculated as follows:

Here, H' represents the updated feature matrix,
σ represents the activation function,
λ _k denotes a learnable modulation matrix to model the relationship between the aggregated features at each k-hop distance and the merged features up to k+1-hop distance,
W represents the above weighted matrix,
H represents the above feature matrix,
denotes the adjacency matrix for the above k-hop distance,
A 3D pose estimation device characterized in that C _k+1 denotes features merged up to k+1 hops distance.

In the first paragraph,
The above 3D pose estimation unit estimates the 3D pose from the skeleton graph through a lifting network,
The above lifting network comprises a plurality of blocks,
A 3D pose estimation device, wherein each of the plurality of above blocks includes a convolutional layer, a batch normalization layer, and a ReLU activation function layer.

In Article 7,
The above lifting network
A 3D pose estimation device, characterized in that input data of a first block is combined with output data of a second block and provided to a third block, wherein the second block is a block in the next order of the first block, and the third block is a block in the next order of the second block.

In Article 8,
A 3D pose estimation device characterized in that the last block of the lifting network does not include the batch normalization and ReLU activation function layers and outputs the 3D position values of each joint.

In a 3D pose estimation method performed by a 3D pose estimation device,
A receiving step for receiving an image containing a person;
A 2D pose estimation step of analyzing the image to extract the type and location of joints according to the person's pose, and generating a skeleton graph based on the type and location of the extracted joints; and
A 3D pose estimation step for estimating a 3D pose corresponding to the skeleton graph using an adjacency matrix; including;
A 3D pose estimation method, characterized in that different adjacency matrices are used corresponding to each repeated process in the process of deriving the kth (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

In the 10th paragraph, the 3D pose estimation step
This is a step of estimating a 3D pose corresponding to the skeleton graph by updating the feature matrix using the feature matrix and weight matrix together in the above adjacency matrix.
The above adjacency matrix is based on the connection relationship between nodes in the above skeleton graph,
The above feature matrix is based on the features contained in each node,
A 3D pose estimation method, characterized in that the above weighting matrix is based on filter values applied to each of the above features.

In Article 11,
The above 3D pose estimation step updates the feature matrix of each node constituting the skeleton graph.
An updated feature matrix is derived by multiplying the above feature matrix by the above adjacency matrix and the above weight matrix,
A 3D pose estimation method characterized by comprising a step of estimating the 3D pose by repeating the process of obtaining a further updated feature matrix by multiplying the adjacency matrix and the weight matrix for the updated feature matrix.

In Article 12,
A 3D pose estimation method, characterized in that when obtaining the kth updated feature matrix, the adjacency matrix has a value of 1 only for nodes that are k hops away on the skeleton graph.

In Article 13,
A 3D pose estimation method characterized in that the above adjacency matrix is a symmetric matrix with diagonal elements of 0 and is normalized.

In Article 13,
The above updated feature matrix is calculated as follows:

Here, H' represents the updated feature matrix,
σ represents the activation function,
λ _k denotes a learnable modulation matrix to model the relationship between the aggregated features at each k-hop distance and the merged features up to k+1-hop distance,
W represents the above weighted matrix,
H represents the above feature matrix,
denotes the adjacency matrix for the above k-hop distance,
A 3D pose estimation method characterized in that C _k+1 means the merged features up to k+1 hop distance.

In Article 10,
The above 3D pose estimation step is a step of estimating the 3D pose from the skeleton graph through a lifting network.
The above lifting network comprises a plurality of blocks,
A 3D pose estimation method, wherein each of the plurality of above blocks includes a convolutional layer, a batch normalization layer, and a ReLU activation function layer.

In Article 16,
The above lifting network
A 3D pose estimation device, characterized in that input data of a first block is combined with output data of a second block and provided to a third block, wherein the second block is a block in the next order of the first block, and the third block is a block in the next order of the second block.

In Article 16,
A 3D pose estimation method, characterized in that the last block of the lifting network does not include the batch normalization and ReLU activation function layers and outputs the 3D position values of each joint.

A computer program stored on a computer-readable recording medium comprising a sequence of commands for estimating a 3D pose from an image,
Receives an image containing a person as input,
2D pose estimation is performed by analyzing the image above to extract the type and location of joints according to the person's posture, and generating a skeleton graph based on the type and location of the extracted joints.
Estimate the 3D pose corresponding to the above skeleton graph using the adjacency matrix.
A computer program stored on a computer-readable recording medium, which includes a sequence of commands that cause different adjacency matrices to be used corresponding to each repeated process in the process of deriving the kth (0<k<K, where K is a natural number greater than or equal to 2) updated feature matrix.

In Article 19,
When estimating the above 3D pose, the 3D pose corresponding to the skeleton graph is estimated by updating the feature matrix by using the feature matrix and the weight matrix together in the above adjacency matrix.
The above adjacency matrix is based on the connection relationship between nodes in the above skeleton graph,
The above feature matrix is based on the features contained in each node,
A computer program stored on a computer-readable recording medium, wherein the weighting matrix includes a sequence of commands based on filter values applied to each of the features.