JP2019016164A

JP2019016164A - Learning data generation device, estimation device, estimation method, and computer program

Info

Publication number: JP2019016164A
Application number: JP2017133070A
Authority: JP
Inventors: 和樹岡見; Kazuki Okami; 広太竹内; Kota Takeuchi; 愛磯貝; Ai Isogai; 木全　英明; Hideaki Kimata; 英明木全
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2019-01-31

Abstract

To provide a learning data generation device, an estimation device, an estimation method, and a computer program that can easily generate a plurality of learning data including three-dimensional joint information.SOLUTION: A leaning data generation device comprises: a silhouette image rendering unit for obtaining three-dimensional model images that are computer graphic images expressing the three-dimensional model of a subject having joints, for each perspective defined in the periphery of the three-dimensional model, and generating the silhouette images of the three-dimensional model for each perspective by carrying out rendering processing on the three-dimensional model images; a camera parameter unit for obtaining camera parameters for each perspective; a shape information restoration unit for, based on the camera parameters of each perspective, from the silhouette images for each perspective, restoring the three-dimensional shape information of the three-dimensional model; and a joint information voxelization unit for generating the three-dimensional joint information of the three-dimensional model in a voxel space that is equal to the voxel space of the three-dimensional shape information of the three-dimensional model.SELECTED DRAWING: Figure 2

Description

本発明は、学習データ生成装置、推定装置、推定方法及びコンピュータプログラムに関する。 The present invention relates to a learning data generation device, an estimation device, an estimation method, and a computer program.

人物の関節の動きを計測する技術を応用することによって、映画などに登場する人を模したコンピュータグラフィックスのキャラクターに生き生きとした動作を付与することができる。そのため、人物の関節の動きを計測する技術は、コンテンツ全体の品質向上を実現するために必要不可欠な技術である。また、上記のようなエンタテインメントの分野以外にも、人物の関節の動きを計測する技術は、様々な分野で用いられている。例えば、医療の分野においては、患者の容体を把握するための情報として重要である。 By applying a technique for measuring the movement of a person's joints, it is possible to give a lively action to a computer graphics character that imitates a person appearing in a movie or the like. Therefore, a technique for measuring the movement of a person's joint is an indispensable technique for improving the quality of the entire content. In addition to the entertainment field as described above, techniques for measuring the movement of a person's joints are used in various fields. For example, in the medical field, it is important as information for grasping the patient's condition.

以下、被写体の三次元モデルの各関節に付与された番号を「関節部位情報」という。以下、被写体の三次元モデルの各関節の位置を示す情報を「関節位置情報」という。以下、関節部位情報及び関節位置情報から成る情報を「関節情報」という。 Hereinafter, the number assigned to each joint of the three-dimensional model of the subject is referred to as “joint part information”. Hereinafter, information indicating the position of each joint of the three-dimensional model of the subject is referred to as “joint position information”. Hereinafter, information including joint part information and joint position information is referred to as “joint information”.

上述したように、関節情報は様々な分野で重要な情報である。しかし、関節情報の取得には多大な手間が伴う。関節情報の取得技術としてモーションキャプチャを用いたデータ取得技術がある。モーションキャプチャでは、計測する対象の人物に専用のスーツを着てもらう必要があり、事前に空間のキャリブレーションを行う必要もあり、と煩雑な作業が必要となる。そのほかの技術に関しても、特殊な機器を必要とするものであったり、限定的な環境でしか利用できなかったりと、様々な問題を抱えている。 As described above, joint information is important information in various fields. However, acquiring joint information involves a great deal of labor. There is a data acquisition technique using motion capture as a technique for acquiring joint information. In motion capture, it is necessary to have a person to be measured wear a special suit, and it is necessary to calibrate the space in advance, which requires complicated work. Other technologies also have various problems, such as those that require special equipment and can only be used in limited environments.

このような問題に対し、ディープラーニングを用いて、画像に撮像された被写体の関節位置をロバストに推定する技術が近年発表された（非特許文献１参照）。この技術では、画像内に複数の人物が存在していたとしても、ロバストな推定が可能である。 In recent years, a technique for robustly estimating the joint position of a subject captured in an image using deep learning has been announced (see Non-Patent Document 1). With this technique, even if there are a plurality of persons in the image, robust estimation is possible.

L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, B. Schiele, “DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, B. Schiele, “DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).

しかしながら、従来の方法では、推定される関節情報は、画像上の二次元の関節位置のみである。そのため、アニメーション生成などに用いる関節情報としては不十分であった。一方で、三次元の関節情報をディープラーニング等の機械学習を用いて推定しようとすると、三次元の関節情報を含む学習データを多量に予め取得する必要があり、困難を伴っていた。 However, in the conventional method, the estimated joint information is only the two-dimensional joint position on the image. Therefore, it is insufficient as joint information used for animation generation. On the other hand, when trying to estimate three-dimensional joint information using machine learning such as deep learning, it is necessary to obtain a large amount of learning data including three-dimensional joint information in advance, which is difficult.

上記事情に鑑み、本発明は、三次元の関節情報を含む複数の学習データをより容易に生成することが可能である学習データ生成装置、推定装置、推定方法及びコンピュータプログラムを提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a learning data generation device, an estimation device, an estimation method, and a computer program that can more easily generate a plurality of learning data including three-dimensional joint information. It is said.

本発明の一態様は、関節を有する被写体の三次元モデルを表すコンピュータグラフィックスの画像である三次元モデル画像を前記三次元モデルの周囲に定められた視点ごとに取得し、前記三次元モデル画像にレンダリング処理を施すことによって前記三次元モデルのシルエット画像を前記視点ごとに生成するシルエット画像レンダリング部と、カメラパラメータを前記視点ごとに取得するカメラパラメータ部と、各視点の前記カメラパラメータに基づいて、各視点の前記シルエット画像から、前記三次元モデルの三次元の形状情報を復元する形状情報復元部と、前記三次元モデルの三次元の形状情報のボクセル空間と同じボクセル空間に、前記三次元モデルの三次元の関節情報を生成する関節情報ボクセル化部とを備える学習データ生成装置である。 According to one aspect of the present invention, a 3D model image, which is a computer graphics image representing a 3D model of a subject having a joint, is acquired for each viewpoint defined around the 3D model, and the 3D model image is acquired. Based on the camera parameters for each viewpoint, a silhouette image rendering unit that generates a silhouette image of the three-dimensional model for each viewpoint by performing rendering processing, a camera parameter unit that acquires camera parameters for each viewpoint, and A three-dimensional shape information restoring unit that restores the three-dimensional shape information of the three-dimensional model from the silhouette image of each viewpoint, and the three-dimensional shape information in the same voxel space as the three-dimensional shape information of the three-dimensional model. A learning data generation device comprising a joint information voxelization unit for generating three-dimensional joint information of a model A.

本発明の一態様は、上記の学習データ生成装置であって、前記形状情報に応じて前記関節情報を出力するディープニューラルネットワークのパラメータを学習する学習部を更に備え、前記ディープニューラルネットワークの出力層は、前記関節情報によって表される前記関節の個数に応じた個数のチャネルを有する。 One aspect of the present invention is the learning data generation device described above, further including a learning unit that learns parameters of a deep neural network that outputs the joint information according to the shape information, and an output layer of the deep neural network Has a number of channels corresponding to the number of joints represented by the joint information.

本発明の一態様は、上記の学習データ生成装置によって生成された三次元モデルの三次元の形状情報に応じて前記三次元モデルの三次元の関節情報を出力するよう学習されたディープニューラルネットワークを用いて、前記三次元モデルに関する被写体の三次元の形状情報を前記ディープニューラルネットワークの入力とすることによって、前記三次元モデルに関する被写体の三次元の関節情報を推定する解析部を備える推定装置である。 According to one aspect of the present invention, there is provided a deep neural network that is trained to output three-dimensional joint information of the three-dimensional model according to the three-dimensional shape information of the three-dimensional model generated by the learning data generation device. The estimation apparatus includes an analysis unit that estimates the three-dimensional joint information of the subject related to the three-dimensional model by using the three-dimensional shape information of the subject related to the three-dimensional model as an input of the deep neural network. .

本発明の一態様は、上記の学習データ生成装置であって、前記形状情報に応じて前記関節情報を出力するディープニューラルネットワークのパラメータを学習する学習部を更に備え、前記学習部は、複数の前記形状情報をまとめることによって前記形状情報のセットを生成し、前記形状情報のセットをディープニューラルネットワークのパラメータの学習に用いるか否かを予め定められた条件に基づいて判定する。 One aspect of the present invention is the learning data generation device described above, further including a learning unit that learns parameters of a deep neural network that outputs the joint information according to the shape information, and the learning unit includes a plurality of learning units. The shape information set is generated by combining the shape information, and whether or not the shape information set is used for learning of parameters of the deep neural network is determined based on a predetermined condition.

本発明の一態様は、上記の推定装置であって、前記形状情報に応じて前記関節情報を出力するディープニューラルネットワークの構成及びパラメータを記憶する記憶部を更に備え、前記ディープニューラルネットワークのパラメータは、複数の前記形状情報をまとめることによって生成された前記形状情報のセットのうち、予め定められた条件を満たした前記形状情報のセットを用いた学習結果に基づくパラメータである。 One aspect of the present invention is the above estimation apparatus, further comprising a storage unit that stores a configuration and parameters of a deep neural network that outputs the joint information according to the shape information, and the parameters of the deep neural network are A parameter based on a learning result using the set of shape information that satisfies a predetermined condition among the set of shape information generated by collecting a plurality of pieces of the shape information.

本発明の一態様は、推定装置が実行する推定方法であって、上記の学習データ生成装置によって生成された三次元モデルの三次元の形状情報に応じて前記三次元モデルの三次元の関節情報を出力するよう学習されたディープニューラルネットワークを用いて、前記三次元モデルに関する被写体の三次元の関節情報を推定するステップを有する推定方法である。 One aspect of the present invention is an estimation method executed by an estimation device, and the three-dimensional joint information of the three-dimensional model according to the three-dimensional shape information of the three-dimensional model generated by the learning data generation device. Is estimated using a deep neural network learned to output the three-dimensional joint information of the subject related to the three-dimensional model.

本発明の一態様は、コンピュータを、上記の学習データ生成装置として機能させるためのコンピュータプログラムである。 One embodiment of the present invention is a computer program for causing a computer to function as the learning data generation device.

本発明の一態様は、コンピュータを、上記の推定装置として機能させるためのコンピュータプログラムである。 One embodiment of the present invention is a computer program for causing a computer to function as the above estimation device.

本発明により、三次元の関節情報を含む複数の学習データをより容易に生成することが可能である。 According to the present invention, it is possible to more easily generate a plurality of learning data including three-dimensional joint information.

第１実施形態における、推定システムの構成の例を示す図である。It is a figure which shows the example of a structure of the estimation system in 1st Embodiment. 第１実施形態における、学習データ生成部の構成の例を示す図である。It is a figure which shows the example of a structure of the learning data generation part in 1st Embodiment. 第１実施形態における、学習部の構成の例を示す図である。It is a figure which shows the example of a structure of the learning part in 1st Embodiment. 第１実施形態における、ディープニューラルネットワークの構成の例を示す図である。It is a figure which shows the example of a structure of the deep neural network in 1st Embodiment. 第１実施形態における、入力データ生成部の構成の例を示す図である。It is a figure which shows the example of a structure of the input data generation part in 1st Embodiment. 第１実施形態における、解析部の構成の例を示す図である。It is a figure which shows the example of a structure of the analysis part in 1st Embodiment. 第１実施形態における、推定装置の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the estimation apparatus in 1st Embodiment. 第１実施形態における、学習データ生成部の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the learning data generation part in 1st Embodiment. 第１実施形態における、学習部の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the learning part in 1st Embodiment. 第１実施形態における、入力データ生成部の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the input data generation part in 1st Embodiment. 第１実施形態における、解析部の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the analysis part in 1st Embodiment. 第２実施形態における、推定システムの構成の例を示す図である。It is a figure which shows the example of a structure of the estimation system in 2nd Embodiment.

本発明の実施形態について、図面を参照して詳細に説明する。
（第１実施形態）
以下では、被写体は、関節を有する生物又は物体であり、例えば、人物、動物、昆虫、ロボット等である。以下では、被写体は、一例として人物である。 Embodiments of the present invention will be described in detail with reference to the drawings.
(First embodiment)
In the following, the subject is a living organism or object having a joint, such as a person, an animal, an insect, or a robot. In the following, the subject is a person as an example.

図１は、推定システム１ａの構成の例を示す図である。推定システム１ａは、推定装置１０を備えるシステムである。推定装置１０は、被写体の三次元の関節位置を推定する情報処理装置である。推定装置１０は、例えば、サーバ装置、パーソナルコンピュータ装置、タブレット端末、スマートフォン端末等である。推定装置１０は、例えば、分散配置された複数の情報処理装置によって推定処理を実行してもよい。推定装置１０は、学習装置１１と、入力データ生成部１４と、解析部１５とを備える。 FIG. 1 is a diagram illustrating an example of the configuration of the estimation system 1a. The estimation system 1a is a system including the estimation device 10. The estimation device 10 is an information processing device that estimates a three-dimensional joint position of a subject. The estimation device 10 is, for example, a server device, a personal computer device, a tablet terminal, a smartphone terminal, or the like. For example, the estimation device 10 may execute the estimation process by using a plurality of information processing devices that are distributed. The estimation device 10 includes a learning device 11, an input data generation unit 14, and an analysis unit 15.

学習装置１１は、学習データ生成部１２と、学習部１３とを備える。学習データ生成部１２は、学習装置１１とは異なる学習データ生成装置に備えられてもよい。この場合、学習データ生成装置によって生成された学習データは、ネットワークや記憶媒体などを介して学習装置１１に与えられてもよい。学習データ生成装置としての学習データ生成部１２は、学習部１３を備えてもよい。推定装置１０は、学習装置１１と別体でもよい。すなわち、推定システム１ａは、推定装置１０と学習装置１１とを別体で備えてもよい。 The learning device 11 includes a learning data generation unit 12 and a learning unit 13. The learning data generation unit 12 may be provided in a learning data generation device that is different from the learning device 11. In this case, the learning data generated by the learning data generation device may be given to the learning device 11 via a network or a storage medium. The learning data generation unit 12 as the learning data generation device may include a learning unit 13. The estimation device 10 may be separate from the learning device 11. That is, the estimation system 1a may include the estimation device 10 and the learning device 11 separately.

推定装置１０は、記憶部を更に備えてもよい。記憶部は、例えば、磁気ハードディスク装置や半導体記憶装置等の不揮発性の記録媒体（非一時的な記録媒体）を有する記憶装置である。記憶部は、機械学習に関する各種データと、画像と、コンピュータプログラムとを記憶してもよい。機械学習に関する各種データとは、例えば、アルゴリズムを表すデータ、ディープニューラルネットワーク（ＤＮＮ: Deep Neural Network）の構成及びパラメータを表すデータである。 The estimation device 10 may further include a storage unit. The storage unit is a storage device having a nonvolatile recording medium (non-temporary recording medium) such as a magnetic hard disk device or a semiconductor storage device. The storage unit may store various data related to machine learning, images, and computer programs. The various types of data related to machine learning are, for example, data representing an algorithm, data representing a configuration and parameters of a deep neural network (DNN).

各機能部の一部又は全部は、例えば、ＣＰＵ（Central Processing Unit）等のプロセッサが、所定の記憶部に記憶されたコンピュータプログラムを実行することにより実現される。各機能部の一部又は全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）等のハードウェアを用いて実現されてもよい。 Part or all of each functional unit is realized by, for example, a processor such as a CPU (Central Processing Unit) executing a computer program stored in a predetermined storage unit. Part or all of each functional unit may be realized by using hardware such as LSI (Large Scale Integration) or ASIC (Application Specific Integrated Circuit).

学習データ生成部１２（学習データ生成装置）は、機械学習に用いられる学習データを生成する。学習データ生成部１２によって生成された学習データは、ディープニューラルネットワーク以外の機械学習にも用いられることができる。被写体の三次元モデルを表す画像である三次元モデル画像は、コンピュータグラフィックスで事前に生成されている。学習データ生成部１２は、関節情報が付与された三次元モデル画像を取得する。学習データ生成部１２は、三次元モデルの周囲の複数のカメラ（多視点）から撮像された三次元モデルのシルエット画像（以下「多視点モデルシルエット画像」という。）を、コンピュータグラフィックスである三次元モデル画像から生成する。 The learning data generation unit 12 (learning data generation device) generates learning data used for machine learning. The learning data generated by the learning data generation unit 12 can be used for machine learning other than the deep neural network. A three-dimensional model image, which is an image representing a three-dimensional model of a subject, is generated in advance by computer graphics. The learning data generation unit 12 acquires a three-dimensional model image to which joint information is added. The learning data generation unit 12 uses a three-dimensional model silhouette image (hereinafter referred to as a “multi-viewpoint model silhouette image”) captured from a plurality of cameras (multi-viewpoints) around the three-dimensional model as a third-order computer graphics. Generate from the original model image.

以下では、被写体の関節（関節部位情報）の個数は、一例として１６個である。例えば、指先の三次元の関節情報が推定される場合、被写体の関節の個数は増加される。また、ディープニューラルネットワークの出力層のチャネル数は、被写体の関節の個数に応じて増加される。 In the following, the number of joints (joint part information) of the subject is 16 as an example. For example, when the three-dimensional joint information of the fingertip is estimated, the number of joints of the subject is increased. Also, the number of channels in the output layer of the deep neural network is increased according to the number of joints of the subject.

学習データ生成部１２は、カメラパラメータを多視点モデルシルエット画像ごとに取得する。学習データ生成部１２は、複数の視点について三次元モデル画像にレンダリング処理を施すことによって、多視点モデルシルエット画像を生成する。学習データ生成部１２は、各視点のカメラパラメータに基づいて、多視点モデルシルエット画像から、三次元モデルの三次元の形状情報（ボクセル形状情報）を復元する。すなわち、学習データ生成部１２は、各視点のカメラパラメータと多視点モデルシルエット画像とに基づいて、三次元モデルの三次元の形状情報を復元する。三次元モデルの三次元の形状情報は、２値情報を含む。三次元モデルの三次元の形状情報の数値は、三次元モデルの形状部分では１であり、三次元モデルの形状部分以外では０である。学習データ生成部１２は、三次元モデルの三次元の形状情報を、機械学習に用いられる学習データ（学習用教師データ）として学習部１３に出力する。 The learning data generation unit 12 acquires camera parameters for each multi-viewpoint model silhouette image. The learning data generation unit 12 generates a multi-view model silhouette image by performing rendering processing on the three-dimensional model image for a plurality of viewpoints. The learning data generation unit 12 restores the three-dimensional shape information (voxel shape information) of the three-dimensional model from the multi-viewpoint model silhouette image based on the camera parameters of each viewpoint. That is, the learning data generation unit 12 restores the three-dimensional shape information of the three-dimensional model based on the camera parameters of each viewpoint and the multi-viewpoint model silhouette image. The three-dimensional shape information of the three-dimensional model includes binary information. The numerical value of the 3D shape information of the 3D model is 1 for the shape portion of the 3D model and 0 for the shape portion other than the 3D model. The learning data generation unit 12 outputs the three-dimensional shape information of the three-dimensional model to the learning unit 13 as learning data (learning teacher data) used for machine learning.

学習データ生成部１２は、関節情報が付与された三次元モデル画像から、関節情報を抽出する。学習データ生成部１２は、三次元モデルの三次元の形状情報と同様のボクセル空間で、関節情報をボクセル化する。すなわち、学習データ生成部１２は、三次元モデルの三次元の形状情報の復元時と同じボクセル空間で、三次元モデルの三次元の関節情報（ボクセル関節情報）を生成する。 The learning data generation unit 12 extracts joint information from the three-dimensional model image to which the joint information is given. The learning data generation unit 12 voxels joint information in the same voxel space as the three-dimensional shape information of the three-dimensional model. That is, the learning data generation unit 12 generates the three-dimensional joint information (voxel joint information) of the three-dimensional model in the same voxel space as when the three-dimensional shape information of the three-dimensional model is restored.

複数の学習データを学習データ生成部１２が生成する場合、学習データ生成部１２は、コンピュータグラフィックスで生成された被写体の三次元の形状を変化させて、三次元モデルの三次元の形状情報を変更する。複数の学習データを学習データ生成部１２が生成する場合、学習データ生成部１２は、形状情報が異なる複数の三次元モデルに、同様のボクセル化処理を施す。 When the learning data generation unit 12 generates a plurality of learning data, the learning data generation unit 12 changes the three-dimensional shape of the subject generated by computer graphics and changes the three-dimensional shape information of the three-dimensional model. change. When the learning data generation unit 12 generates a plurality of learning data, the learning data generation unit 12 performs the same voxelization process on a plurality of three-dimensional models having different shape information.

学習部１３は、学習データを用いて機械学習を実行する。学習部１３は、推定処理を推定装置１０が実行するごとに機械学習を実行する必要はない。例えば、学習部１３は、推定処理を推定装置１０が実行する前に機械学習を完了してもよい。 The learning unit 13 performs machine learning using the learning data. The learning unit 13 does not need to perform machine learning each time the estimation apparatus 10 performs the estimation process. For example, the learning unit 13 may complete the machine learning before the estimation apparatus 10 executes the estimation process.

学習部１３は、ディープニューラルネットワークに限らず、例えば、遺伝的プログラミングやクラスタリング等の機械学習を実行してもよい。第１実施形態では、学習部１３は、機械学習の一例として、ディープニューラルネットワークの学習処理を実行する。ディープニューラルネットワークの入力は、三次元モデルの三次元の形状情報である。ディープニューラルネットワークの出力は、三次元の関節情報である。ディープニューラルネットワークの出力チャネルは、三次元の関節情報を構成する関節部位情報に対応付けられている。 The learning unit 13 is not limited to the deep neural network, and may execute machine learning such as genetic programming or clustering, for example. In the first embodiment, the learning unit 13 executes a deep neural network learning process as an example of machine learning. The input of the deep neural network is 3D shape information of the 3D model. The output of the deep neural network is three-dimensional joint information. The output channel of the deep neural network is associated with joint part information constituting three-dimensional joint information.

複数の学習データを学習データ生成部１２が生成した場合、学習部１３は、学習データ生成部１２によって生成された複数の学習データに基づいて、ディープニューラルネットワークのパラメータを学習する。学習部１３は、学習済みディープニューラルネットワークの構成及びパラメータを表す情報を、解析部１５に出力する。 When the learning data generation unit 12 generates a plurality of learning data, the learning unit 13 learns the parameters of the deep neural network based on the plurality of learning data generated by the learning data generation unit 12. The learning unit 13 outputs information representing the configuration and parameters of the learned deep neural network to the analysis unit 15.

以下、被写体の周囲の複数のカメラ（多視点）から被写体が撮像された実写画像を「多視点実写画像」という。入力データ生成部１４は、関節情報が推定される対象としての被写体が撮像されている多視点実写画像を取得する。入力データ生成部１４は、多視点実写画像のカメラパラメータを取得する。 Hereinafter, a real image obtained by imaging a subject from a plurality of cameras (multi-viewpoints) around the subject is referred to as a “multi-view real image”. The input data generation unit 14 acquires a multi-view photographed image in which a subject as an object for which joint information is estimated is captured. The input data generation unit 14 acquires camera parameters of the multi-viewpoint photographed image.

入力データ生成部１４は、学習データ生成部１２が多視点モデル画像に施したボクセル化処理と同様のボクセル化処理を多視点実写画像に施すことによって、実写の被写体の三次元の形状情報を復元する。被写体の三次元の形状情報は、２値情報を含む。被写体の三次元の形状情報の数値は、三次元モデルの形状部分では１であり、三次元モデルの形状部分以外では０である。 The input data generation unit 14 applies the voxelization process similar to the voxelization process performed on the multi-view model image by the learning data generation unit 12 to restore the three-dimensional shape information of the live-action subject. To do. The three-dimensional shape information of the subject includes binary information. The numerical value of the 3D shape information of the subject is 1 in the shape portion of the 3D model, and 0 in the shape portion other than the shape portion of the 3D model.

解析部１５は、学習済みディープニューラルネットワークのパラメータを、学習部１３から取得する。解析部１５は、入力データ生成部１４によって復元された被写体の三次元の形状情報を、学習済みディープニューラルネットワークの入力とする。解析部１５は、被写体の三次元の関節位置情報を、学習済みディープニューラルネットワークのチャネルごとの出力とする。これによって、解析部１５は、被写体の三次元の関節情報を推定することができる。 The analysis unit 15 acquires the parameters of the learned deep neural network from the learning unit 13. The analysis unit 15 uses the three-dimensional shape information of the subject restored by the input data generation unit 14 as the input of the learned deep neural network. The analysis unit 15 uses the three-dimensional joint position information of the subject as an output for each channel of the learned deep neural network. Thereby, the analysis unit 15 can estimate the three-dimensional joint information of the subject.

次に、各機能部の詳細を説明する。
図２は、学習データ生成部１２の構成の例を示す図である。学習データ生成部１２は、シルエット画像レンダリング部１２１と、カメラパラメータ出力部１２２と、形状情報復元部１２３と、関節情報出力部１２４と、関節情報ボクセル化部１２５とを備える。 Next, details of each functional unit will be described.
FIG. 2 is a diagram illustrating an example of the configuration of the learning data generation unit 12. The learning data generation unit 12 includes a silhouette image rendering unit 121, a camera parameter output unit 122, a shape information restoration unit 123, a joint information output unit 124, and a joint information voxelization unit 125.

シルエット画像レンダリング部１２１は、関節情報が付与された三次元モデル画像を、視点（カメラの位置）ごとに取得する。シルエット画像レンダリング部１２１は、複数の視点について三次元モデル画像にレンダリング処理を施すことによって、被写体の三次元モデルのシルエット画像を視点ごとに生成する。すなわち、シルエット画像レンダリング部１２１は、複数の視点について三次元モデル画像にレンダリング処理を施すことによって、多視点モデルシルエット画像を生成する。複数の視点は、形状情報の推定の用途に合わせて定められる。複数の視点は、例えば被写体の全周囲の各視点である。被写体を撮像する複数のカメラの位置（視点）が定まっている場合、複数のカメラの向きは固定されてもよい。シルエット画像レンダリング部１２１は、多視点モデルシルエット画像を、形状情報復元部１２３に出力する。 The silhouette image rendering unit 121 acquires a three-dimensional model image to which joint information is added for each viewpoint (camera position). The silhouette image rendering unit 121 generates a silhouette image of the 3D model of the subject for each viewpoint by rendering the 3D model image for a plurality of viewpoints. That is, the silhouette image rendering unit 121 generates a multi-viewpoint model silhouette image by performing rendering processing on the three-dimensional model image for a plurality of viewpoints. The plurality of viewpoints are determined according to the use of shape information estimation. The plurality of viewpoints are, for example, viewpoints around the entire periphery of the subject. When the positions (viewpoints) of a plurality of cameras that capture the subject are fixed, the orientations of the plurality of cameras may be fixed. The silhouette image rendering unit 121 outputs the multi-viewpoint model silhouette image to the shape information restoring unit 123.

カメラパラメータ出力部１２２（カメラパラメータ部）は、関節情報が付与された三次元モデル画像からカメラパラメータを抽出することによって、カメラパラメータを視点ごとに取得する。すなわち、カメラパラメータ出力部１２２は、カメラパラメータを多視点モデルシルエット画像ごとに取得する。カメラパラメータ出力部１２２は、多視点モデルシルエット画像のカメラパラメータを、視点ごとに形状情報復元部１２３に出力する。 The camera parameter output unit 122 (camera parameter unit) acquires camera parameters for each viewpoint by extracting camera parameters from the three-dimensional model image to which the joint information is added. That is, the camera parameter output unit 122 acquires camera parameters for each multi-viewpoint model silhouette image. The camera parameter output unit 122 outputs the camera parameters of the multi-viewpoint model silhouette image to the shape information restoration unit 123 for each viewpoint.

形状情報復元部１２３は、各視点のカメラパラメータに基づいて、多視点モデルシルエット画像から、三次元モデルの三次元の形状情報を復元する。すなわち、形状情報復元部１２３は、各視点のカメラパラメータと多視点モデルシルエット画像とに基づいて、三次元モデルの三次元の形状情報を復元する。形状情報の復元処理の方法は、特定の方法に限定されないが、例えば視体積交差法である。復元される形状情報のボクセル空間の範囲と、復元される形状情報のボクセル空間におけるボクセルの体積と、復元される形状情報のボクセル空間におけるボクセルの解像度との各種パラメータは、任意に定められる。形状情報復元部１２３は、三次元モデルの三次元の形状情報を、学習部１３に出力する。 The shape information restoration unit 123 restores the three-dimensional shape information of the three-dimensional model from the multi-view model silhouette image based on the camera parameters of each viewpoint. That is, the shape information restoration unit 123 restores the three-dimensional shape information of the three-dimensional model based on the camera parameters of each viewpoint and the multi-viewpoint model silhouette image. The method of shape information restoration processing is not limited to a specific method, but is, for example, a visual volume intersection method. Various parameters such as the range of the voxel space of the shape information to be restored, the volume of the voxel in the voxel space of the shape information to be restored, and the resolution of the voxel in the voxel space of the shape information to be restored are arbitrarily determined. The shape information restoration unit 123 outputs the three-dimensional shape information of the three-dimensional model to the learning unit 13.

関節情報出力部１２４は、関節情報が付与された三次元モデル画像を取得する。関節情報出力部１２４は、関節情報が付与された三次元モデル画像から、三次元の関節情報を抽出する。関節情報を構成する関節位置情報を表現する座標系は、例えば、ｘｙｚ軸を用いたワールド座標系である。関節情報出力部１２４は、三次元の関節情報を関節情報ボクセル化部１２５に出力する。 The joint information output unit 124 acquires a three-dimensional model image to which joint information is added. The joint information output unit 124 extracts three-dimensional joint information from the three-dimensional model image to which the joint information is added. The coordinate system representing the joint position information constituting the joint information is, for example, a world coordinate system using the xyz axis. The joint information output unit 124 outputs the three-dimensional joint information to the joint information voxelization unit 125.

関節情報ボクセル化部１２５は、三次元モデルの三次元の形状情報と同様のボクセル空間で、関節情報をボクセル化する。すなわち、関節情報ボクセル化部１２５は、三次元モデルの三次元の形状情報の復元時と同じボクセル空間で、三次元モデルの三次元の関節情報を生成する。各関節位置情報に付与された関節部位情報である番号は、任意の順序又は規則で各関節位置情報に付与されてもよい。例えば、関節部位情報である番号は、各関節位置情報に付与された１〜１６の連続値でもよい。関節情報ボクセル化部１２５は、三次元モデルの三次元の関節情報（三次元モデルのボクセル化された関節情報）を、学習部１３に出力する。 The joint information voxelization unit 125 voxels joint information in a voxel space similar to the 3D shape information of the 3D model. That is, the joint information voxelization unit 125 generates the three-dimensional joint information of the three-dimensional model in the same voxel space as that when the three-dimensional shape information of the three-dimensional model is restored. The number which is the joint part information given to each joint position information may be given to each joint position information in an arbitrary order or rule. For example, the number which is joint part information may be a continuous value of 1 to 16 given to each joint position information. The joint information voxelization unit 125 outputs the three-dimensional joint information of the three-dimensional model (joint information obtained by voxelizing the three-dimensional model) to the learning unit 13.

図３は、学習部１３の構成の例を示す図である。学習部１３は、ボクセルセット生成部１３１と、ボクセルセット判定部１３２と、ネットワーク構築部１３３と、パラメータ学習部１３４とを備える。 FIG. 3 is a diagram illustrating an example of the configuration of the learning unit 13. The learning unit 13 includes a voxel set generation unit 131, a voxel set determination unit 132, a network construction unit 133, and a parameter learning unit 134.

ボクセルセット生成部１３１は、複数の三次元モデルの三次元の形状情報を、形状情報復元部１２３から取得する。ボクセルセット生成部１３１は、複数の三次元モデルの三次元の形状情報をまとめることで、三次元モデルの三次元の形状情報のセット（以下「形状情報ボクセルセット」という。）を生成する。すなわち、ボクセルセット生成部１３１は、複数の三次元モデルの三次元の形状情報をボクセルセット化する。ボクセルセット生成部１３１は、三次元モデルの三次元の形状情報をボクセルセット化する場合、形状情報ボクセルセット同士の空間位置が重複してもよいし、形状情報ボクセルセット同士の空間位置が重複しないように、いずれかの形状情報ボクセルセットの空間位置を移動させてもよい。 The voxel set generation unit 131 acquires three-dimensional shape information of a plurality of three-dimensional models from the shape information restoration unit 123. The voxel set generation unit 131 generates a set of three-dimensional shape information of the three-dimensional model (hereinafter referred to as “shape information voxel set”) by collecting the three-dimensional shape information of a plurality of three-dimensional models. That is, the voxel set generation unit 131 converts the three-dimensional shape information of a plurality of three-dimensional models into a voxel set. When the voxel set generation unit 131 converts the 3D shape information of the 3D model into voxel sets, the spatial positions of the shape information voxel sets may overlap, or the spatial positions of the shape information voxel sets do not overlap. In this way, the spatial position of any shape information voxel set may be moved.

ボクセルセット生成部１３１は、複数の三次元モデルの三次元の関節情報を、関節情報ボクセル化部１２５から取得する。ボクセルセット生成部１３１は、複数の三次元モデルの三次元の関節情報をまとめることで、三次元モデルの三次元の関節情報のセット（以下「関節情報ボクセルセット」という。）を生成する。すなわち、ボクセルセット生成部１３１は、複数の三次元モデルの三次元の関節情報をボクセルセット化する。ボクセルセット生成部１３１は、三次元モデルの三次元の関節情報をボクセルセット化する場合、関節情報ボクセルセット同士が重複してもよいし、関節情報ボクセルセット同士が重複しないように、いずれかの関節情報ボクセルセットの空間位置を移動させてもよい。 The voxel set generation unit 131 acquires three-dimensional joint information of a plurality of three-dimensional models from the joint information voxelization unit 125. The voxel set generation unit 131 generates a set of three-dimensional joint information of the three-dimensional model (hereinafter referred to as “joint information voxel set”) by collecting the three-dimensional joint information of the plurality of three-dimensional models. That is, the voxel set generation unit 131 converts the three-dimensional joint information of a plurality of three-dimensional models into a voxel set. When the voxel set generation unit 131 converts the three-dimensional joint information of the three-dimensional model into a voxel set, the joint information voxel sets may overlap with each other, or the joint information voxel sets may not overlap with each other. The spatial position of the joint information voxel set may be moved.

ボクセルセット生成部１３１は、空間位置が同一である形状情報ボクセルセット及び関節情報ボクセルセットを対応付ける。空間位置が同一である形状情報ボクセルセット及び関節情報ボクセルセットの各種パラメータには、同一のパラメータが定められる。例えば、以下では、ディープニューラルネットワークで一貫した学習処理を行うために、形状情報ボクセルセット及び関節情報ボクセルセットの立方体の空間の一辺の長さは、２の指数乗の一例である８に定められる。なお、ボクセルセットの空間の一辺の長さは、特定の長さに限定されない。 The voxel set generation unit 131 associates the shape information voxel set and the joint information voxel set having the same spatial position. The same parameters are determined as various parameters of the shape information voxel set and the joint information voxel set having the same spatial position. For example, in the following, in order to perform consistent learning processing in the deep neural network, the length of one side of the cube space of the shape information voxel set and the joint information voxel set is set to 8 which is an example of an exponential power of 2 . Note that the length of one side of the voxel set space is not limited to a specific length.

ボクセルセット判定部１３２は、ディープニューラルネットワークのパラメータの学習に使用するボクセルセットを選択する。ボクセルセット判定部１３２は、ディープニューラルネットワークのパラメータの学習に使用する形状情報ボクセルセットであるか否かを、予め定められた条件に基づいて形状情報ボクセルセットごとに判定する。例えば、ボクセルセット判定部１３２は、形状情報ボクセルセットについて、三次元モデルの形状部分の空間が形状情報ボクセルセットの全体の空間に対して所定割合以上であるか否かを判定する。すなわち、ボクセルセット判定部１３２は、数値が１であるボクセルの個数が形状情報ボクセルセットの全体におけるボクセルの個数に対して所定割合以上である形状情報ボクセルセットを、ディープニューラルネットワークのパラメータの学習に用いるボクセルセットであると判定する。所定割合は、例えば、３分の１である。 The voxel set determination unit 132 selects a voxel set used for learning parameters of the deep neural network. The voxel set determination unit 132 determines, for each shape information voxel set, whether or not the shape information voxel set is used for learning the parameters of the deep neural network based on a predetermined condition. For example, for the shape information voxel set, the voxel set determination unit 132 determines whether the space of the shape portion of the three-dimensional model is equal to or greater than a predetermined ratio with respect to the entire space of the shape information voxel set. That is, the voxel set determination unit 132 uses the shape information voxel set in which the number of voxels whose numerical value is 1 is equal to or greater than a predetermined ratio to the number of voxels in the entire shape information voxel set to learn the parameters of the deep neural network. It is determined that this is a voxel set to be used. The predetermined ratio is, for example, one third.

ボクセルセット判定部１３２は、ディープニューラルネットワークのパラメータの学習に使用する関節情報ボクセルセットであるか否かを、関節情報ボクセルセットごとに判定する。ボクセルセット判定部１３２は、パラメータの学習に用いる形状情報ボクセルセットに対応付けられた関節情報ボクセルセットを、ディープニューラルネットワークのパラメータの学習に用いる関節情報ボクセルセットであると判定する。 The voxel set determining unit 132 determines, for each joint information voxel set, whether or not the joint information voxel set is used for learning the parameters of the deep neural network. The voxel set determining unit 132 determines that the joint information voxel set associated with the shape information voxel set used for parameter learning is a joint information voxel set used for deep neural network parameter learning.

ディープラーニングの学習には、大量の学習データと膨大な処理時間とが必要である。このため従来では、グラフィックス・プロセッシング・ユニット（ＧＰＵ: Graphics Processing Unit）を用いた、ディープラーニングの高速な学習処理を試みる場合がある。本実施形態では、学習処理の対象は三次元のデータである。このため、ＧＰＵは、二次元のデータの学習処理よりもはるかに大量の学習データを扱うことになる。入力された大量の学習データをそのままの状態でディープニューラルネットワークの学習に用いた場合、ＧＰＵのメモリは不足する可能性が高い。そこで、ボクセルセット判定部１３２は、学習処理の対象をボクセルセット化して、学習に必要なボクセルセットのみを選択的にパラメータ学習部１３４に学習させる。これによって、学習部１３は、メモリの節約及び高速処理化が可能となる。 Deep learning requires a large amount of learning data and a huge amount of processing time. For this reason, conventionally, there is a case where high-speed learning processing of deep learning using a graphics processing unit (GPU) is tried. In the present embodiment, the target of the learning process is three-dimensional data. For this reason, the GPU handles much larger amount of learning data than the learning process of two-dimensional data. When a large amount of input learning data is used as it is for deep neural network learning, there is a high possibility that the GPU memory is insufficient. Therefore, the voxel set determination unit 132 converts the target of the learning process into a voxel set, and causes the parameter learning unit 134 to selectively learn only the voxel set necessary for learning. As a result, the learning unit 13 can save memory and increase the processing speed.

また、学習部１３が形状情報を三次元のデータとして扱うため、ボクセルセットにおいて形状情報が存在しない領域がボクセルセットの空間全体に占める割合は非常に大きい。すなわち、ボクセルセットにおいて形状情報として意味を持たない領域がボクセルセットの空間全体に占める割合は非常に大きい。このため、ボクセルセットにおいて形状情報として意味を持たない領域は、形状情報として意味を持つ領域と同様にディープニューラルネットワークのパラメータの学習に使用された場合には、パラメータの学習結果に影響を強く与えてしまうことになる。したがって、形状情報として意味を持たない領域を学習部１３がディープニューラルネットワークのパラメータの学習に使用することは、パラメータの学習結果の精度に多大な悪影響を及ぼすこととなる。 In addition, since the learning unit 13 handles the shape information as three-dimensional data, the ratio of the area in which no shape information exists in the voxel set to the entire space of the voxel set is very large. In other words, the ratio of the area having no meaning as shape information in the voxel set to the entire space of the voxel set is very large. For this reason, regions that have no meaning as shape information in the voxel set have a strong influence on the parameter learning results when they are used for deep neural network parameter learning as well as regions that have meaning as shape information. It will end up. Therefore, the use of the region having no meaning as shape information by the learning unit 13 for learning the parameters of the deep neural network has a great adverse effect on the accuracy of the parameter learning results.

学習部１３は、三次元モデルの三次元の形状情報をボクセルセット化してから、三次元モデルの三次元の形状情報として意味を持たない領域をパラメータの学習対象から除外するので、パラメータの学習結果の精度を向上させることができる。なお、学習部１３は、ｌｏｓｓ計算をする際に、形状情報として意味を持たない領域の関節部位情報（ラベル）の重み付けを非常に小さな値に設定してもよい。 Since the learning unit 13 converts the three-dimensional shape information of the three-dimensional model into a voxel set, and excludes a region that has no meaning as the three-dimensional shape information of the three-dimensional model from the parameter learning target, the learning result of the parameter Accuracy can be improved. Note that when the loss calculation is performed, the learning unit 13 may set the weight of the joint part information (label) of the region having no meaning as the shape information to a very small value.

図４は、ディープニューラルネットワークの構成の例を示す図である。ネットワーク構築部１３３は、ディープニューラルネットワークを構築する。ディープニューラルネットワークの入力は、カーネルサイズ「８×８×８」でチャネル数が１ｃｈの情報を持つ形状情報ボクセルセットである。ディープニューラルネットワークの入力の形状情報ボクセルセットは、ボクセルセット判定部１３２によって学習に使用すると判定された形状情報ボクセルセットである。 FIG. 4 is a diagram illustrating an example of the configuration of a deep neural network. The network construction unit 133 constructs a deep neural network. The input of the deep neural network is a shape information voxel set having information of the kernel size “8 × 8 × 8” and the number of channels of 1ch. The input shape information voxel set of the deep neural network is a shape information voxel set determined to be used for learning by the voxel set determination unit 132.

ディープニューラルネットワークは、入力の形状情報ボクセルセットに対して、カーネルサイズ「３×３×３」のフィルタを用いた「ｃｏｎｖｏｌｕｔｉｏｎ」処理（畳み込み処理）と、カーネルサイズ「２×２×２」でストライド幅が２であるプーリング処理とを、１０ｃｈのチャネル数で実行する。活性化関数は、例えば、ＲｅＬＵ（Rectified Linear Unit function）である（図４の上段の「Ｃｏｎｖ１」）。 The deep neural network uses a "convolution" process (convolution process) using a filter with a kernel size of "3x3x3" and a stride with a kernel size of "2x2x2" for the input shape information voxel set A pooling process with a width of 2 is executed with the number of channels of 10 ch. The activation function is, for example, ReLU (Rectified Linear Unit function) (“Conv1” in the upper part of FIG. 4).

ディープニューラルネットワークは、チャネル数を２０ｃｈに変更して、同様に「ｃｏｎｖｏｌｕｔｉｏｎ」処理を実行する（「Ｃｏｎｖ２」）。また、ネットワーク構築部１３３は、チャネル数を１６ｃｈに変更して、同様に「ｃｏｎｖｏｌｕｔｉｏｎ」処理を実行する（「Ｃｏｎｖ３」）。更に、ネットワーク構築部１３３は、カーネルサイズ「３×３×３」のフィルタを用いた「ｄｅｃｏｎｖｏｌｕｔｉｏｎ」処理を、１６ｃｈのチャネル数で３回実行する（「ＤｅＣｏｎｖ１」〜「ＤｅＣｏｎｖ３」）。 The deep neural network changes the number of channels to 20 ch, and similarly executes “convolution” processing (“Conv2”). In addition, the network construction unit 133 changes the number of channels to 16ch, and similarly executes “convolution” processing (“Conv3”). Furthermore, the network construction unit 133 executes “devolution” processing using a filter having a kernel size “3 × 3 × 3” three times with the number of channels of 16 channels (“DeConv1” to “DeConv3”).

ディープニューラルネットワークは、カーネルサイズが縮小された三次元の形状情報を、元のカーネルサイズの三次元の形状情報に復元する。ネットワーク構築部１３３は、カーネルサイズ「１×１×１」のフィルタを用いた「ｃｏｎｖｏｌｕｔｉｏｎ」処理を、１６ｃｈのチャネル数で実行する（図４の下段の「Ｃｏｎｖ１」）。 The deep neural network restores the three-dimensional shape information with the kernel size reduced to the original three-dimensional shape information with the kernel size. The network construction unit 133 executes “convolution” processing using a filter having a kernel size “1 × 1 × 1” with the number of channels of 16 channels (“Conv1” in the lower part of FIG. 4).

ディープニューラルネットワークは、チャネル方向に「ｓｏｆｔｍａｘ」処理を施した結果である数値を、チャネルごとに出力する。ディープニューラルネットワークの出力は、入力の形状情報ボクセルセットに対応する関節情報ボクセルセットである。ディープニューラルネットワークは、関節部位情報に対応するチャネルから、関節部位情報に対応する関節位置情報を出力する。 The deep neural network outputs a numerical value that is a result of performing the “softmax” process in the channel direction for each channel. The output of the deep neural network is a joint information voxel set corresponding to the input shape information voxel set. The deep neural network outputs joint position information corresponding to joint part information from a channel corresponding to joint part information.

出力の関節情報ボクセルセットのカーネルサイズは、入力の形状情報ボクセルセットのカーネルサイズと同様に「８×８×８」である。ディープニューラルネットワークの出力層のチャネル数は、推定される関節部位の個数と、推定される関節部位のいずれにも付与されてない関節部位情報の個数とが加算された結果を表す個数である。ここでは、ディープニューラルネットワークの出力層のチャネル数は、推定される関節部位の個数（関節部位情報の個数）を表す１６ｃｈに１ｃｈを加算した１７ｃｈである。１個のボクセルの各チャネルに含まれる値は、そのチャネルに対応付けられた関節位置情報がそのボクセルと一致する確率を表す。最適化手法は、例えば、「Ａｄａｍ」である。学習率は、例えば、１０^−４である。 The kernel size of the output joint information voxel set is “8 × 8 × 8” similarly to the kernel size of the input shape information voxel set. The number of channels in the output layer of the deep neural network is a number representing the result of adding the estimated number of joint sites and the number of joint site information not assigned to any of the estimated joint sites. Here, the number of channels in the output layer of the deep neural network is 17 ch obtained by adding 1 ch to 16 ch representing the estimated number of joint parts (number of joint part information). The value included in each channel of one voxel represents the probability that the joint position information associated with that channel matches that voxel. The optimization method is, for example, “Adam”. The learning rate is, for example, 10 ⁻⁴ .

最適化手法の種類は、各種パラメータとして任意に定められてもよい。カーネルサイズ、ストライド幅、活性化関数の種類、学習率等である各種パラメータは、任意の値又は関数に定められてもよい。ただし、入力の形状情報ボクセルセットと出力の関節情報ボクセルセットとでボクセル解像度が同一となるように、各種パラメータは定められる。 The type of optimization method may be arbitrarily determined as various parameters. Various parameters such as kernel size, stride width, type of activation function, learning rate, and the like may be set to arbitrary values or functions. However, various parameters are determined so that the input shape information voxel set and the output joint information voxel set have the same voxel resolution.

図３に戻り、学習部１３の構成の例の説明を続ける。パラメータ学習部１３４は、学習に使用すると判定された形状情報ボクセルセットを、ボクセルセット判定部１３２から取得する。パラメータ学習部１３４は、学習に使用すると判定された関節情報ボクセルセットを、ボクセルセット判定部１３２から取得する。パラメータ学習部１３４は、構築されたディープニューラルネットワークの構成を表す情報を、ネットワーク構築部１３３から取得する。パラメータ学習部１３４は、構築されたディープニューラルネットワークを用いて、ディープニューラルネットワークのパラメータを学習する。 Returning to FIG. 3, the description of the configuration example of the learning unit 13 is continued. The parameter learning unit 134 acquires the shape information voxel set determined to be used for learning from the voxel set determination unit 132. The parameter learning unit 134 acquires the joint information voxel set determined to be used for learning from the voxel set determination unit 132. The parameter learning unit 134 acquires information representing the configuration of the constructed deep neural network from the network construction unit 133. The parameter learning unit 134 learns the parameters of the deep neural network using the constructed deep neural network.

パラメータ学習部１３４は、ディープニューラルネットワークの初期パラメータを、例えば、ディープラーニング用ライブラリ「ｃｈａｉｎｅｒ」において定められたパラメータとする。パラメータ学習部１３４は、ネットワーク構築部１３３によって構築されたディープニューラルネットワークを用いて、例えば、１万回の反復学習を実行する。初期パラメータ及び反復回数は、任意に予め定められる。パラメータ学習部１３４は、学習済みディープニューラルネットワークの構成及びパラメータを表す情報を、解析部１５に出力する。 The parameter learning unit 134 uses the initial parameters of the deep neural network as parameters defined in the deep learning library “chainer”, for example. The parameter learning unit 134 executes, for example, 10,000 iterations of learning using the deep neural network constructed by the network construction unit 133. The initial parameters and the number of iterations are arbitrarily determined in advance. The parameter learning unit 134 outputs information representing the configuration and parameters of the learned deep neural network to the analysis unit 15.

このようにして、パラメータ学習部１３４は、被写体の三次元の関節位置情報を多視点実写画像から推定するためのディープニューラルネットワークのパラメータを定めることができる。また、パラメータ学習部１３４は、三次元の形状情報に「ｃｏｎｖｏｌｕｔｉｏｎ」処理及びプーリング処理を繰り返し施すので、被写体の三次元の関節位置情報を多視点実写画像から推定することが可能となる。 In this way, the parameter learning unit 134 can determine the parameters of the deep neural network for estimating the three-dimensional joint position information of the subject from the multi-viewpoint photographed image. Further, the parameter learning unit 134 repeatedly performs the “convolution” process and the pooling process on the three-dimensional shape information, so that the three-dimensional joint position information of the subject can be estimated from the multi-viewpoint photographed image.

図５は、入力データ生成部１４の構成の例を示す図である。入力データ生成部１４は、シルエット生成部１４１と、形状情報生成部１４２とを備える。シルエット生成部１４１は、関節情報が推定される対象としての被写体が撮像された多視点実写画像を取得する。多視点実写画像は、どのような方式のカメラで被写体が撮影された画像でもよいが、例えば、カラーカメラで被写体が撮像された画像である。入力データ生成部１４は、多視点実写画像をハードディスクドライブ等の記録媒体から取得してもよい。 FIG. 5 is a diagram illustrating an example of the configuration of the input data generation unit 14. The input data generation unit 14 includes a silhouette generation unit 141 and a shape information generation unit 142. The silhouette generation unit 141 acquires a multi-view live-action image in which a subject as an object for which joint information is estimated is captured. The multi-view photographed image may be an image obtained by photographing the subject with any type of camera, but is, for example, an image obtained by photographing the subject with a color camera. The input data generation unit 14 may acquire the multi-viewpoint photographed image from a recording medium such as a hard disk drive.

シルエット生成部１４１は、多視点実写画像から被写体領域を抽出することによって、多視点の被写体のシルエット画像（以下「多視点被写体シルエット画像」という。）を生成する。シルエット生成部１４１は、多視点被写体シルエット画像を任意の手法を用いて生成してもよいが、例えば、背景差分又はグラフカットの手法で多視点被写体シルエット画像を生成する。 The silhouette generation unit 141 generates a silhouette image of a multi-view subject (hereinafter referred to as a “multi-view subject silhouette image”) by extracting a subject area from the multi-view live-action image. The silhouette generation unit 141 may generate a multi-view subject silhouette image using any method, but for example, generates a multi-view subject silhouette image using a background difference or graph cut method.

形状情報生成部１４２は、各視点のカメラパラメータに基づいて、多視点被写体シルエット画像から、被写体の三次元の形状情報を復元する。すなわち、形状情報生成部１４２は、各視点のカメラパラメータと多視点被写体シルエット画像とに基づいて、被写体の三次元の形状情報を復元する。形状情報生成部１４２は、学習データ生成部１２の形状情報復元部１２３が実行した処理と同様の処理を実行することで、多視点被写体シルエット画像から、被写体の三次元の形状情報を復元する。例えば、形状情報生成部１４２は、形状情報復元部１２３が使用した各種パラメータと同一のパラメータを使用して、多視点被写体シルエット画像から、被写体の三次元の形状情報を復元する。形状情報生成部１４２は、被写体の三次元の形状情報を、解析部１５に出力する。 The shape information generation unit 142 restores the three-dimensional shape information of the subject from the multi-view subject silhouette image based on the camera parameters of each viewpoint. That is, the shape information generation unit 142 restores the three-dimensional shape information of the subject based on the camera parameters of each viewpoint and the multi-view subject silhouette image. The shape information generation unit 142 restores the three-dimensional shape information of the subject from the multi-view subject silhouette image by executing a process similar to the process executed by the shape information restoration unit 123 of the learning data generation unit 12. For example, the shape information generation unit 142 restores the three-dimensional shape information of the subject from the multi-view subject silhouette image using the same parameters as the various parameters used by the shape information restoration unit 123. The shape information generation unit 142 outputs the three-dimensional shape information of the subject to the analysis unit 15.

図６は、解析部１５の構成の例を示す図である。解析部１５は、被写体の三次元の形状情報を、形状情報生成部１４２から取得する。解析部１５は、学習済みディープニューラルネットワークの構成及びパラメータを表す情報を、パラメータ学習部１３４から取得する。解析部１５は、被写体の三次元の形状情報を学習済みディープニューラルネットワークの入力とすることによって、被写体の三次元の形状情報から被写体の関節位置情報を推定する。 FIG. 6 is a diagram illustrating an example of the configuration of the analysis unit 15. The analysis unit 15 acquires the three-dimensional shape information of the subject from the shape information generation unit 142. The analysis unit 15 acquires information representing the configuration and parameters of the learned deep neural network from the parameter learning unit 134. The analysis unit 15 estimates the joint position information of the subject from the three-dimensional shape information of the subject by using the three-dimensional shape information of the subject as an input of the learned deep neural network.

解析部１５は、ディープニューラルネットワークの出力層のチャネルに対応付けられた関節部位情報を関節位置情報に付与することによって、被写体の三次元の関節情報を生成する。これによって、解析部１５は、各視点のカメラパラメータに基づいて、多視点被写体シルエット画像から被写体の三次元の関節情報を推定することが可能となる。解析部１５は、被写体の三次元の関節情報を外部装置に出力する。外部装置が記録媒体である場合、解析部１５は、被写体の三次元の関節情報を外部装置に記録してもよい。 The analysis unit 15 generates the three-dimensional joint information of the subject by adding the joint part information associated with the channel of the output layer of the deep neural network to the joint position information. Accordingly, the analysis unit 15 can estimate the three-dimensional joint information of the subject from the multi-view subject silhouette image based on the camera parameters of each viewpoint. The analysis unit 15 outputs the three-dimensional joint information of the subject to the external device. When the external device is a recording medium, the analysis unit 15 may record the three-dimensional joint information of the subject on the external device.

次に、推定装置１０の動作の例を説明する。
図７は、推定装置１０の動作の例を示すフローチャートである。推定装置１０は、多視点モデルシルエット画像を生成する（ステップＳ１０１）。推定装置１０は、各視点のカメラパラメータに基づいて、多視点モデルシルエット画像から、三次元モデルの三次元の形状情報を復元する（ステップＳ１０２）。推定装置１０は、三次元モデルの三次元の形状情報と三次元モデルの三次元の関節情報とに基づいて、ディープニューラルネットワークのパラメータを定める（ステップＳ１０３）。 Next, an example of the operation of the estimation device 10 will be described.
FIG. 7 is a flowchart illustrating an example of the operation of the estimation apparatus 10. The estimation apparatus 10 generates a multi-viewpoint model silhouette image (step S101). The estimation apparatus 10 restores the 3D shape information of the 3D model from the multi-viewpoint model silhouette image based on the camera parameters of each viewpoint (step S102). The estimation apparatus 10 determines the parameters of the deep neural network based on the 3D shape information of the 3D model and the 3D joint information of the 3D model (step S103).

推定装置１０は、多視点被写体シルエット画像から、被写体の三次元の形状情報を復元する（ステップＳ１０４）。推定装置１０は、被写体の三次元の形状情報を学習済みディープニューラルネットワークの入力とすることによって、被写体の三次元の関節情報を推定する（ステップＳ１０５）。推定装置１０は、被写体の三次元の関節情報を外部装置に出力及び記録する（ステップＳ１０６）。 The estimation device 10 restores the three-dimensional shape information of the subject from the multi-view subject silhouette image (step S104). The estimation apparatus 10 estimates the three-dimensional joint information of the subject by using the three-dimensional shape information of the subject as an input of the learned deep neural network (step S105). The estimation device 10 outputs and records the three-dimensional joint information of the subject to the external device (step S106).

図８は、学習データ生成部１２の動作の例を示すフローチャートである。学習データ生成部１２は、関節情報が付与された三次元モデル画像を、視点ごとに取得する（ステップＳ２０１）。学習データ生成部１２は、視点ごとの三次元モデル画像から多視点モデルシルエット画像を生成し、多視点モデルシルエット画像とカメラパラメータとを出力する（ステップＳ２０２）。 FIG. 8 is a flowchart illustrating an example of the operation of the learning data generation unit 12. The learning data generation unit 12 acquires a three-dimensional model image to which joint information is assigned for each viewpoint (step S201). The learning data generation unit 12 generates a multi-view model silhouette image from the three-dimensional model image for each viewpoint, and outputs the multi-view model silhouette image and camera parameters (step S202).

学習データ生成部１２は、カメラパラメータに基づいて、多視点モデルシルエット画像から、三次元モデルの三次元の形状情報を復元する（ステップＳ２０３）。学習データ生成部１２は、三次元モデルの三次元の形状情報と三次元モデルの三次元の関節情報とを、学習部１３に出力する（ステップＳ２０４）。 The learning data generation unit 12 restores the three-dimensional shape information of the three-dimensional model from the multi-view model silhouette image based on the camera parameters (step S203). The learning data generation unit 12 outputs the 3D shape information of the 3D model and the 3D joint information of the 3D model to the learning unit 13 (step S204).

図９は、学習部１３の動作の例を示すフローチャートである。学習部１３は、形状情報ボクセルセットと関節情報ボクセルセットとを生成する（ステップＳ３０１）。学習部１３は、学習に用いるボクセルセットであるか否かを、ボクセルセットごとに判定する（ステップＳ３０２）。学習部１３は、形状情報ボクセルセットを入力とし、関節情報ボクセルセットを出力とするディープニューラルネットワークを構築する（ステップＳ３０３）。学習部１３は、ディープニューラルネットワークのパラメータを学習する（ステップＳ３０４）。 FIG. 9 is a flowchart illustrating an example of the operation of the learning unit 13. The learning unit 13 generates a shape information voxel set and a joint information voxel set (step S301). The learning unit 13 determines for each voxel set whether the voxel set is used for learning (step S302). The learning unit 13 constructs a deep neural network having the shape information voxel set as an input and the joint information voxel set as an output (step S303). The learning unit 13 learns the parameters of the deep neural network (step S304).

図１０は、入力データ生成部１４の動作の例を示すフローチャートである。入力データ生成部１４は、被写体が撮像された多視点実写画像を取得する（ステップＳ４０１）。入力データ生成部１４は、多視点実写画像から多視点被写体シルエット画像を生成する（ステップＳ４０２）。入力データ生成部１４は、各視点のカメラパラメータに基づいて、多視点被写体シルエット画像から、被写体の三次元の形状情報を復元する（ステップＳ４０３）。入力データ生成部１４は、被写体の三次元の形状情報を、解析部１５に出力する（ステップＳ４０４）。 FIG. 10 is a flowchart illustrating an example of the operation of the input data generation unit 14. The input data generation unit 14 acquires a multi-viewpoint live-action image in which the subject is imaged (step S401). The input data generation unit 14 generates a multi-view subject silhouette image from the multi-view live-action image (step S402). The input data generation unit 14 restores the three-dimensional shape information of the subject from the multi-view subject silhouette image based on the camera parameters of each viewpoint (step S403). The input data generation unit 14 outputs the three-dimensional shape information of the subject to the analysis unit 15 (step S404).

図１１は、解析部１５の動作の例を示すフローチャートである。解析部１５は、学習済みディープニューラルネットワークを用いて、被写体の三次元の形状情報から被写体の三次元の関節情報を推定する（ステップＳ５０１）。解析部１５は、被写体の三次元の関節情報を、外部装置に出力及び記録する（ステップＳ５０２）。 FIG. 11 is a flowchart illustrating an example of the operation of the analysis unit 15. The analysis unit 15 estimates the three-dimensional joint information of the subject from the three-dimensional shape information of the subject using the learned deep neural network (step S501). The analysis unit 15 outputs and records the three-dimensional joint information of the subject to the external device (step S502).

以上のように、第１実施形態の学習データ生成装置としての学習データ生成部１２は、シルエット画像レンダリング部１２１と、カメラパラメータ出力部１２２と、形状情報復元部１２３と、関節情報ボクセル化部１２５とを備える。シルエット画像レンダリング部１２１は、三次元モデル画像を三次元モデルの周囲に定められた視点ごとに取得する。シルエット画像レンダリング部１２１は、三次元モデル画像にレンダリング処理を施すことによって、多視点モデルシルエット画像を視点ごとに生成する。カメラパラメータ出力部１２２は、カメラパラメータを視点ごとに取得する。形状情報復元部１２３は、各視点のカメラパラメータに基づいて、多視点モデルシルエット画像から、三次元モデルの三次元の形状情報を復元する。関節情報ボクセル化部１２５は、三次元モデルの三次元の形状情報のボクセル空間と同じボクセル空間に、三次元モデルの三次元の関節情報を生成する。これによって、第１実施形態の学習データ生成装置としての学習データ生成部１２は、三次元の関節情報を含む複数の学習データをより容易に生成することが可能である。 As described above, the learning data generation unit 12 as the learning data generation device of the first embodiment includes the silhouette image rendering unit 121, the camera parameter output unit 122, the shape information restoration unit 123, and the joint information voxelization unit 125. With. The silhouette image rendering unit 121 acquires a 3D model image for each viewpoint determined around the 3D model. The silhouette image rendering unit 121 generates a multi-view model silhouette image for each viewpoint by rendering the 3D model image. The camera parameter output unit 122 acquires camera parameters for each viewpoint. The shape information restoration unit 123 restores the three-dimensional shape information of the three-dimensional model from the multi-view model silhouette image based on the camera parameters of each viewpoint. The joint information voxelization unit 125 generates the 3D joint information of the 3D model in the same voxel space as the voxel space of the 3D shape information of the 3D model. Thereby, the learning data generation unit 12 as the learning data generation device of the first embodiment can more easily generate a plurality of learning data including three-dimensional joint information.

第１実施形態の学習データ生成装置としての学習データ生成部１２は、学習部１３を更に備えてもよい。学習部１３は、三次元の形状情報に応じて三次元の関節情報を出力するディープニューラルネットワークのパラメータを学習する。ディープニューラルネットワークの出力層は、三次元の関節情報によって表される関節の個数に応じた個数のチャネルを有する。学習部１３は、複数の三次元の形状情報をまとめることによって、形状情報ボクセルセットを生成する。学習部１３は、形状情報ボクセルセットをディープニューラルネットワークのパラメータの学習に用いるか否かを、予め定められた条件に基づいて判定する。 The learning data generation unit 12 as the learning data generation device of the first embodiment may further include a learning unit 13. The learning unit 13 learns parameters of a deep neural network that outputs three-dimensional joint information according to three-dimensional shape information. The output layer of the deep neural network has a number of channels corresponding to the number of joints represented by the three-dimensional joint information. The learning unit 13 generates a shape information voxel set by collecting a plurality of three-dimensional shape information. The learning unit 13 determines whether or not the shape information voxel set is used for learning the parameters of the deep neural network based on a predetermined condition.

第１実施形態の学習装置１１は、コンピュータグラフィックスの画像に基づいて学習データを生成する。これによって、第１実施形態の学習装置１１は、大量の学習データを容易に生成することが可能である。第１実施形態の学習装置１１は、大量の学習データを容易に拡張することができる。第１実施形態の学習装置１１は、関節情報が付与されている画像を取得するので、関節情報を画像から取得する手間を削減することが可能である。 The learning device 11 according to the first embodiment generates learning data based on a computer graphics image. Thereby, the learning device 11 according to the first embodiment can easily generate a large amount of learning data. The learning device 11 of the first embodiment can easily expand a large amount of learning data. Since the learning device 11 according to the first embodiment acquires an image to which joint information is given, it is possible to reduce the trouble of acquiring joint information from the image.

（第２実施形態）
第２実施形態では、機械学習に関する各種データを推定装置１０が記憶する点が、第１実施形態と相違する。第２実施形態では、第１実施形態との相違点についてのみ説明する。 (Second Embodiment)
The second embodiment is different from the first embodiment in that the estimation device 10 stores various data related to machine learning. In the second embodiment, only differences from the first embodiment will be described.

図１２は、推定システム１ｂの構成の例を示す図である。推定システム１ｂは、推定装置１０を備える。推定システム１ｂにおいて推定装置１０が三次元の関節情報の推定処理のみを実行する場合、推定システム１ｂは、学習装置１１を備えていなくてもよい。すなわち、推定システム１ｂにおいて推定装置１０が推定処理のみを実行する場合、学習装置１１は機械学習を実行しなくてもよい。推定システム１ａにおいて学習装置１１が機械学習を実行する場合には、推定システム１ｂは、学習装置１１を備える。 FIG. 12 is a diagram illustrating an example of the configuration of the estimation system 1b. The estimation system 1 b includes an estimation device 10. In the estimation system 1b, when the estimation device 10 executes only the three-dimensional joint information estimation process, the estimation system 1b may not include the learning device 11. That is, when the estimation device 10 executes only the estimation process in the estimation system 1b, the learning device 11 does not have to execute machine learning. When the learning device 11 performs machine learning in the estimation system 1a, the estimation system 1b includes the learning device 11.

学習装置１１は、学習データ生成部１２と、学習部１３とを備える。学習データ生成部１２は、学習装置１１とは異なる学習データ生成装置に備えられてもよい。この場合、学習データ生成装置によって生成された学習データは、ネットワークや記憶媒体などを介して学習装置１１に与えられてもよい。学習データ生成装置としての学習データ生成部１２は、学習部１３を更に備えてもよい。 The learning device 11 includes a learning data generation unit 12 and a learning unit 13. The learning data generation unit 12 may be provided in a learning data generation device that is different from the learning device 11. In this case, the learning data generated by the learning data generation device may be given to the learning device 11 via a network or a storage medium. The learning data generation unit 12 as a learning data generation device may further include a learning unit 13.

学習部１３は、学習データを用いて機械学習を実行する。学習部１３は、ディープニューラルネットワークに限らず、例えば、遺伝的プログラミングやクラスタリング等の機械学習を実行してもよい。学習部１３は、機械学習に関する各種データを、通信回線又は記録媒体等を介して推定装置１０に記録する。機械学習に関する各種データとは、例えば、アルゴリズムを表すデータ、学習済みディープニューラルネットワークの構成及びパラメータを表すデータである。 The learning unit 13 performs machine learning using the learning data. The learning unit 13 is not limited to the deep neural network, and may execute machine learning such as genetic programming or clustering, for example. The learning unit 13 records various data related to machine learning in the estimation device 10 via a communication line or a recording medium. The various types of data related to machine learning are, for example, data representing an algorithm, data representing a configuration and parameters of a learned deep neural network.

推定装置１０は、入力データ生成部１４と、解析部１５と、記憶部１６とを備える。記憶部１６は、学習部１３によって生成された機械学習に関する各種データを記憶する。解析部１５は、学習済みディープニューラルネットワークのパラメータを、記憶部１６から取得する。学習済みディープニューラルネットワークのパラメータは、形状情報ボクセルセットのうち、予め定められた条件を満たした形状情報ボクセルセットを用いた学習結果に基づくパラメータである。予め定められた条件とは、例えば、三次元モデルの形状部分の空間が形状情報ボクセルセットの全体の空間に対して所定割合以上であるという条件である。 The estimation device 10 includes an input data generation unit 14, an analysis unit 15, and a storage unit 16. The storage unit 16 stores various data related to machine learning generated by the learning unit 13. The analysis unit 15 acquires the learned deep neural network parameters from the storage unit 16. The learned deep neural network parameter is a parameter based on a learning result using a shape information voxel set that satisfies a predetermined condition among shape information voxel sets. The predetermined condition is, for example, a condition that the space of the shape portion of the three-dimensional model is a predetermined ratio or more with respect to the entire space of the shape information voxel set.

以上のように、第２実施形態の推定装置１０は、解析部１５を備える。解析部１５は、学習装置１１によって生成された三次元モデルの三次元の形状情報に応じて三次元モデルの三次元の関節情報を出力するよう学習されたディープニューラルネットワークを用いて、被写体の三次元の関節情報を推定する。これによって、第２実施形態の推定装置１０は、被写体の三次元の関節位置を精度良く推定することが可能である。また、第２実施形態の学習データ生成装置としての学習データ生成部１２は、三次元の関節情報を含む複数の学習データをより容易に生成することが可能である。 As described above, the estimation device 10 according to the second embodiment includes the analysis unit 15. The analysis unit 15 uses the deep neural network learned to output the three-dimensional joint information of the three-dimensional model according to the three-dimensional shape information of the three-dimensional model generated by the learning device 11, and uses the deep neural network learned. Estimate the original joint information. Thereby, the estimation apparatus 10 of the second embodiment can accurately estimate the three-dimensional joint position of the subject. In addition, the learning data generation unit 12 as the learning data generation device of the second embodiment can more easily generate a plurality of learning data including three-dimensional joint information.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１ａ，１ｂ…推定システム、１０…推定装置、１１…学習装置、１２…学習データ生成部、１３…学習部、１４…入力データ生成部、１５…解析部、１６…記憶部、１２１…シルエット画像レンダリング部、１２２…カメラパラメータ出力部、１２３…形状情報復元部、１２４…関節情報出力部、１２５…関節情報ボクセル化部、１３１…ボクセルセット生成部、１３２…ボクセルセット判定部、１３３…ネットワーク構築部、１３４…パラメータ学習部、１４１…シルエット生成部、１４２…形状情報生成部 DESCRIPTION OF SYMBOLS 1a, 1b ... Estimation system, 10 ... Estimation apparatus, 11 ... Learning apparatus, 12 ... Learning data generation part, 13 ... Learning part, 14 ... Input data generation part, 15 ... Analysis part, 16 ... Memory | storage part, 121 ... Silhouette image Rendering unit 122 ... Camera parameter output unit 123 ... Shape information restoration unit 124 ... Joint information output unit 125 ... Joint information voxelization unit 131 ... Voxel set generation unit 132 ... Voxel set determination unit 133 ... Network construction 134, parameter learning unit, 141 ... silhouette generation unit, 142 ... shape information generation unit

Claims

By obtaining a three-dimensional model image, which is a computer graphics image representing a three-dimensional model of a subject having a joint, for each viewpoint determined around the three-dimensional model, and performing rendering processing on the three-dimensional model image A silhouette image rendering unit for generating a silhouette image of the three-dimensional model for each viewpoint;
A camera parameter section for acquiring camera parameters for each viewpoint;
Based on the camera parameters of each viewpoint, a shape information restoration unit that restores the three-dimensional shape information of the three-dimensional model from the silhouette image of each viewpoint;
A learning data generation device comprising: a joint information voxelization unit that generates three-dimensional joint information of the three-dimensional model in the same voxel space as the three-dimensional shape information voxel space of the three-dimensional model.

A learning unit that learns parameters of a deep neural network that outputs the joint information according to the shape information;
The learning data generation apparatus according to claim 1, wherein an output layer of the deep neural network includes a number of channels corresponding to the number of the joints represented by the joint information.

A deep neural network trained to output the three-dimensional joint information of the three-dimensional model according to the three-dimensional shape information of the three-dimensional model generated by the learning data generating device according to claim 1 or 2. An estimation apparatus comprising: an analysis unit that estimates the three-dimensional joint information of the subject related to the three-dimensional model by using the three-dimensional shape information of the subject related to the three-dimensional model as an input of the deep neural network.

A learning unit that learns parameters of a deep neural network that outputs the joint information according to the shape information;
The learning unit generates the set of shape information by collecting a plurality of the shape information, and whether to use the set of shape information for deep neural network parameter learning based on a predetermined condition The learning data generation device according to claim 1, wherein the determination is made.

A storage unit for storing a configuration and parameters of a deep neural network that outputs the joint information according to the shape information;
The parameter of the deep neural network is a parameter based on a learning result using the set of shape information that satisfies a predetermined condition among the set of shape information generated by collecting a plurality of the shape information. The estimation device according to claim 3.

An estimation method executed by an estimation device,
A deep neural network trained to output the three-dimensional joint information of the three-dimensional model according to the three-dimensional shape information of the three-dimensional model generated by the learning data generating device according to claim 1 or 2. An estimation method comprising a step of estimating three-dimensional joint information of a subject related to the three-dimensional model using

A computer program for causing a computer to function as the learning data generation device according to claim 1, claim 2 or claim 4.

A computer program for causing a computer to function as the estimation device according to claim 3 or 5.