RU2713695C1

RU2713695C1 - Textured neural avatars

Info

Publication number: RU2713695C1
Application number: RU2019104985A
Authority: RU
Inventors: Александра Петровна ШИШЕЯ; Егор Олегович ЗАХАРОВ; Игорь Игоревич ПАСЕЧНИК; Егор Андреевич БУРКОВ; Карим Жамалович ИСКАКОВ; Юрий Андреевич Мальков; Александр Тимурович ВАХИТОВ; Кара-Али Алибулатович АЛИЕВ; Алексей Александрович Ивахненко; Ренат Маратович Баширов; Дмитрий Владимирович Ульянов; Виктор Сергеевич Лемпицкий
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2020-02-06

Abstract

FIELD: computer equipment.

SUBSTANCE: invention relates to the computer equipment. Method for synthesis of a two-dimensional human image, in which three-dimensional coordinates of positions of human body joints are received, given in a camera coordinate system, wherein three-dimensional coordinates of positions of body joints set a human pose and a viewpoint of a two-dimensional image; using a trained machine learning predictor is predicted, a stack of body parts assignment maps and a stack of body parts coordinates maps based on three dimensional coordinates of the body joints, wherein the body parts coordinates map stack sets the coordinates of the human body parts pixels, the body parts assignment stack sets the weights, each weight indicating the probability that a specific pixel belongs to a specific part of the human body, wherein the trained machine learning predictor is trained for a plurality of different human postures and different human viewing points; retrieving from memory a previously initialized stack of texture maps for human body parts; and reconstructing a two-dimensional human image as a weighted combination of pixel values using a stack of body parts assignments, a stack of coordinates maps of parts of the body and a stack of texture maps.

EFFECT: technical result consists in providing the possibility of obtaining two-dimensional images of the entire human body in different positions and from different points of view using artificial intelligence.

6 cl, 4 dwg

Description

Область техники, к которой относится изобретениеFIELD OF THE INVENTION

Настоящее изобретение относится, в общем, к области компьютерного зрения и компьютерной графики для создания изображения всего тела человека в разных позах и при разных положениях камеры и, в частности, к системе и способу синтеза двумерного изображения человека.The present invention relates, in General, to the field of computer vision and computer graphics for creating images of the entire human body in different poses and at different camera positions and, in particular, to a system and method for synthesizing a two-dimensional human image.

Описание предшествующего уровня техникиDescription of the Related Art

Одной из основных задач компьютерного зрения и компьютерной графики является захват и визуализация тела человека во всей его сложности в разных позах и условиях получения изображения. В последнее время возрос интерес к глубоким сверточным сетям (ConvNets) как альтернативе классических графических конвейеров. Появилась возможность реалистичной нейронной визуализации фрагментов тела, например, лица [48, 34, 29], глаз [18], рук [37] и т.д. В недавних работах была продемонстрирована способность таких сетей к созданию видов человека с изменяющейся позой, но при фиксированном положении камеры и в прилегающей к телу одежде [1, 9, 33, 53].One of the main tasks of computer vision and computer graphics is the capture and visualization of the human body in all its complexity in various poses and conditions for image acquisition. Recently, there has been growing interest in deep convolution networks (ConvNets) as an alternative to classic graphics pipelines. The possibility of realistic neural visualization of body fragments, for example, of the face [48, 34, 29], eyes [18], hands [37], etc. Recent work has demonstrated the ability of such networks to create human species with a changing posture, but with a fixed camera position and clothing adjacent to the body [1, 9, 33, 53].

Настоящее изобретение, находящееся на пересечении нескольких областей исследований, тесно связано с очень большим количеством предыдущих работ, и некоторые из этих связей обсуждаются ниже.The present invention, at the intersection of several research areas, is closely related to a very large number of previous works, and some of these links are discussed below.

Геометрическое моделирование тела человека. Создание аватаров всего тела из данных изображения давно стало одной из главных тем исследований в области компьютерного зрения. Традиционно аватар задается трехмерной геометрической сеткой определенной нейтральной позы, текстурой и механизмом скиннинга, которые трансформируют вершины сетки в соответствии с изменениями позы. Большая группа работ посвящена моделированию тела из 3D сканеров [40], зарегистрированных многовидовых последовательностей [42], а также последовательностей глубины и RGB-D [5, 55]. С другой стороны, существуют способы, которые подгоняют скинированные параметрические модели тела к одиночным изображениям [45, 4, 23, 6, 27, 39]. И наконец, начались исследования по созданию аватаров всего тела из монокулярных видео [3, 2]. Как и в последней группе работ, в настоящем изобретении аватар создается из видео или набора монокулярных видео в открытом доступе. Классический подход (компьютерная графика) к моделированию аватаров человека требует явного физически правдоподобного моделирования кожи, волос и склеры человека, отражения поверхности одежды, а также явного физически правдоподобного моделирования движения при изменениях позы. Несмотря на значительный прогресс в моделировании отражения [56, 13, 30, 58] и улучшение скиннинга и моделирования динамических поверхностей [46, 17, 35], метод компьютерной графики все еще требует значительных "ручных" усилий дизайнеров для достижения высокой реалистичности и для прохождения эффекта так называемой "зловещей долины" [36], особенно, если требуется визуализация аватаров в реальном времени.Geometric modeling of the human body. Creating avatars of the whole body from image data has long been one of the main research topics in the field of computer vision. Traditionally, an avatar is defined by a three-dimensional geometric grid of a certain neutral pose, texture and a skinning mechanism that transform the vertices of the grid in accordance with changes in the pose. A large group of works is devoted to modeling the body from 3D scanners [40], registered multi-species sequences [42], as well as depth and RGB-D sequences [5, 55]. On the other hand, there are methods that fit the skinned parametric models of the body to single images [45, 4, 23, 6, 27, 39]. And finally, research has begun on creating full-body avatars from monocular videos [3, 2]. As in the last group of works, in the present invention, an avatar is created from a video or a set of monocular videos in the public domain. The classical approach (computer graphics) to modeling human avatars requires explicit physically plausible modeling of the skin, hair and sclera of a person, reflection of the surface of clothing, as well as explicit physically plausible modeling of movement with changes in posture. Despite significant progress in reflection modeling [56, 13, 30, 58] and improved skinning and modeling of dynamic surfaces [46, 17, 35], the computer graphics method still requires significant “manual” efforts of designers to achieve high realism and to pass the effect of the so-called "sinister valley" [36], especially if real-time avatar visualization is required.

Нейронное моделирование тела человека. Синтез изображений с помощью глубоких сверточных нейронных сетей является динамично развивающейся областью исследований [20, 15], и в последнее время много усилий направлено на синтез реалистичных лиц людей [28, 11, 47]. В отличие от традиционных представлений компьютерной графики глубокие ConvNets моделируют данные путем подгонки избыточного числа изучаемых весов к обучающим данным. Такие ConvNets избегают явного моделирования геометрии поверхности, отражения поверхности или движения поверхности при изменениях позы и поэтому не страдают от недостатка реалистичности соответствующих компонентов. С другой стороны, отсутствие укоренившихся геометрических или фотометрических моделей в этом методе означает, что обобщение применительно к новым позам и, в частности, новым точкам обзора камеры может быть проблематичным. За последние несколько лет значительный прогресс достигнут в области нейронного моделирования персонализированных моделей "говорящих голов" [48, 29, 34], волос [54], рук [37]. В течение нескольких последних месяцев несколько групп представили результаты нейронного моделирования всего тела [1, 9, 53, 33]. Хотя представленные результаты достаточно впечатляющие, они все еще ограничивают обучение, и тестовое изображение соответствует одному и тому же полю зрения камеры, что по опыту авторов значительно упрощает задачу по сравнению с моделированием внешнего вида тела с произвольной точки обзора. Цель настоящего изобретения состоит в том, чтобы расширить подход к нейронному моделированию тела для решения последней, более сложной задачи.Neural modeling of the human body. Synthesis of images using deep convolutional neural networks is a dynamically developing field of research [20, 15], and recently, a lot of effort has been directed towards the synthesis of realistic human faces [28, 11, 47]. Unlike traditional representations of computer graphics, deep ConvNets simulate data by fitting an excessive number of studied weights to training data. Such ConvNets avoid explicit modeling of surface geometry, surface reflection, or surface movement during posture changes and therefore do not suffer from a lack of realism of the respective components. On the other hand, the absence of entrenched geometric or photometric models in this method means that generalization in relation to new poses and, in particular, new camera viewing points can be problematic. Over the past few years, significant progress has been made in the field of neural modeling of personalized models of “talking heads” [48, 29, 34], hair [54], hands [37]. Over the past few months, several groups presented the results of neural modeling of the whole body [1, 9, 53, 33]. Although the results presented are quite impressive, they still limit learning, and the test image corresponds to the same field of view of the camera, which, according to the authors' experience, greatly simplifies the task compared to modeling the appearance of the body from an arbitrary point of view. The purpose of the present invention is to expand the approach to neural modeling of the body to solve the latter, more complex task.

Модели с нейронным деформированием. В ряде недавних работ фотографию человека деформируют в новое фотореалистичное изображение с измененным направлением взгляда [18], измененным выражением лица/позой [7, 50, 57, 43] или измененной позой тела [50, 44, 38], причем поле деформирования оценивают с помощью глубокой сверточной сети (в то время как исходная фотография служит особым видом текстуры). Однако эти методы имеют ограниченную реалистичность и/или количество изменений, которые они могут смоделировать, поскольку они берут за основу одну фотографию данного человека в качестве ввода. В настоящем изобретении также текстура отделяется от геометрии поверхности/моделирования движения, но осуществляется обучение по видео, что позволяет решить более сложную задачу (многовидовую настройку для всего тела) и достичь более высокой реалистичности.Models with neural deformation. In a number of recent works, a person’s photograph is deformed into a new photorealistic image with a changed gaze [18], a changed facial expression / posture [7, 50, 57, 43] or a changed body pose [50, 44, 38], and the deformation field is evaluated with using a deep convolution network (while the original photo serves as a special kind of texture). However, these methods have limited realism and / or the number of changes that they can simulate, since they take as a basis one photograph of a given person as an input. In the present invention, the texture is also separated from the geometry of the surface / motion simulation, but training is carried out by video, which allows to solve a more complex problem (multi-view setting for the whole body) and to achieve higher realism.

DensePose и связанные с нею методы. В основу предложенной системы положена параметризация поверхности тела (UV параметризация), аналогичной той, которая используется в классическом графическом представлении. Часть предложенной системы выполняет преобразование из позы тела в параметры поверхности (UV координаты) пикселей изображения. Тем самым настоящее изобретение приближается к методу DensePose [21] и более ранним работам [49, 22], которые предсказывают UV координаты пикселей изображения по входной фотографии. Кроме того, в настоящем изобретении используются результаты DensePose для предварительного обучения.DensePose and related methods. The proposed system is based on parameterization of the body surface (UV parameterization), similar to that used in the classical graphical representation. Part of the proposed system performs conversion from the body pose to the surface parameters (UV coordinates) of the image pixels. Thus, the present invention approaches the DensePose method [21] and earlier works [49, 22], which predict the UV coordinates of image pixels from the input photograph. In addition, the present invention uses DensePose results for pre-training.

Основанное на текстуре представление многовидовых данных. Предложенная система также связана с методами, в которых текстуры извлекаются из коллекций многовидовых изображений [31, 19] или коллекций многовидовых видео [52] или единственного видео [41]. Предложенный метод также связан с системами сжатия видео и рендеринга с нефиксированной точкой обзора, например [52, 8, 12, 16]. В отличие от этих работ, предложенный метод ограничен сценами, содержащими одного человека. В то же время, предложенный метод направлен на обобщение не только новых полей зрения камеры, но также и новых поз пользователя, которые отсутствуют в обучающих видео. В этой группе наиболее близкой к изобретению является работа [59], так как в ней деформируют отдельные кадры набора многовидовых видео в соответствии с целевой позой для создания новых последовательностей. Однако они способны обрабатывать только те позы, которые имеют близкое совпадение в обучающем наборе, что является сильным ограничением, учитывая комбинаторный характер пространства конфигурации поз человека.Texture-based presentation of multi-view data. The proposed system is also associated with methods in which textures are extracted from collections of multi-view images [31, 19] or collections of multi-view videos [52] or a single video [41]. The proposed method is also associated with video compression and rendering systems with a fixed point of view, for example [52, 8, 12, 16]. In contrast to these works, the proposed method is limited to scenes containing one person. At the same time, the proposed method is aimed at summarizing not only the new fields of view of the camera, but also new user poses that are not in the training videos. In this group, the work closest to the invention is [59], since it deforms individual frames of a set of multi-view videos in accordance with the target pose for creating new sequences. However, they are able to process only those poses that have a close match in the training set, which is a strong limitation, given the combinatorial nature of the configuration space of human poses.

ЛИТЕРАТУРА:LITERATURE:

[1] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or. Deep video-based performance cloning. arXiv preprint arXiv:1808.06847, 2018.[1] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or. Deep video-based performance cloning. arXiv preprint arXiv: 1808.06847, 2018.

[2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human avatars from monocular video. In 2018 International Conference on 3D Vision (3DV), pages 98-109. IEEE, 2018.[2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human avatars from monocular video. In 2018 International Conference on 3D Vision (3DV), pages 98-109. IEEE, 2018.

[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[4] A. O. B˘alan and M. J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15-29. Springer, 2008.[4] A. O. B˘alan and M. J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15-29. Springer, 2008.

[5] F. Bogo, M. J. Black, M. Loper, and J. Romero. Detailed full-body reconstructions of moving people from monocular rgb-d sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 2300-2308, 2015.[5] F. Bogo, M. J. Black, M. Loper, and J. Romero. Detailed full-body reconstructions of moving people from monocular rgb-d sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 2300-2308, 2015.

[6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561-578. Springer, 2016.[6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561-578. Springer, 2016.

[7] J. Cao, Y. Hu, H. Zhang, R. He, and Z. Sun. Learning a high fidelity pose invariant model for high-resolution face frontalization. arXiv preprint arXiv:1806.08472, 2018.[7] J. Cao, Y. Hu, H. Zhang, R. He, and Z. Sun. Learning a high fidelity pose invariant model for high-resolution face frontalization. arXiv preprint arXiv: 1806.08472, 2018.

[8] D. Casas, M. Volino, J. Collomosse, and A. Hilton. 4d video textures for interactive character appearance. In Computer Graphics Forum, volume 33, pages 371-380. Wiley Online Library, 2014.[8] D. Casas, M. Volino, J. Collomosse, and A. Hilton. 4d video textures for interactive character appearance. In Computer Graphics Forum, volume 33, pages 371-380. Wiley Online Library, 2014.

[9] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018.[9] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv: 1808.07371, 2018.

[10] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1520-1529, 2017.[10] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1520-1529, 2017.

[11] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain image-to-image translation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[11] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain image-to-image translation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[12] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. Highquality streamable free-viewpoint video. ACM Transactions on Graphics (TOG), 34(4):69, 2015.[12] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. Highquality streamable free-viewpoint video. ACM Transactions on Graphics (TOG), 34 (4): 69, 2015.

[13] C. Donner, T. Weyrich, E. d'Eon, R. Ramamoorthi, and S. Rusinkiewicz. A layered, heterogeneous reflectance model for acquiring and rendering human skin. In ACM Transactions on Graphics (TOG), volume 27, page 140. ACM, 2008.[13] C. Donner, T. Weyrich, E. d'Eon, R. Ramamoorthi, and S. Rusinkiewicz. A layered, heterogeneous reflectance model for acquiring and rendering human skin. In ACM Transactions on Graphics (TOG), volume 27, page 140. ACM, 2008.

[14] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In Proc. NIPS, pages 658-666, 2016.[14] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In Proc. NIPS, pages 658-666, 2016.

[15] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538-1546, 2015.[15] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538-1546, 2015.

[16] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhemann, V. Tankovich, and S. Izadi. Motion2fusion: real-time volumetric performance capture. ACM Transactions on Graphics (TOG), 36(6):246, 2017.[16] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhemann, V. Tankovich, and S. Izadi. Motion2fusion: real-time volumetric performance capture. ACM Transactions on Graphics (TOG), 36 (6): 246, 2017.

[17] A. Feng, D. Casas, and A. Shapiro. Avatar reshaping and automatic rigging using a deformable model. In Proceedings of the 8th ACM SIGGRAPH Conference on Motion in Games, pages 57-64. ACM, 2015.[17] A. Feng, D. Casas, and A. Shapiro. Avatar reshaping and automatic rigging using a deformable model. In Proceedings of the 8th ACM SIGGRAPH Conference on Motion in Games, pages 57-64. ACM, 2015.

[18] Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European Conference on Computer Vision, pages 311-326. Springer, 2016.[18] Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European Conference on Computer Vision, pages 311-326. Springer, 2016.

[19] B. Goldl¨ucke and D. Cremers. Superresolution texture maps for multiview reconstruction. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pages 1677-1684, 2009.[19] B. Goldl¨ucke and D. Cremers. Superresolution texture maps for multiview reconstruction. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pages 1677-1684, 2009.

[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.

[21] R.A. G¨uler, N. Neverova, and I. Kokkinos. DensePose: Dense human pose estimation in the wild. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[21] R.A. G¨uler, N. Neverova, and I. Kokkinos. DensePose: Dense human pose estimation in the wild. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[22] R.A. G¨uler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2, page 5, 2017.3[22] R.A. G¨uler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2, page 5, 2017.3

[23] N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormahlen, and H.-P. Seidel. Multilinear pose and body shape estimation of dressed subjects from image sets. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1823-1830. IEEE, 2010.[23] N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormahlen, and H.-P. Seidel. Multilinear pose and body shape estimation of dressed subjects from image sets. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1823-1830. IEEE, 2010.

[24] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, pages 5967-5976, 2017.[24] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, pages 5967-5976, 2017.

[25] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Proc. NIPS, pages 2017-2025, 2015.[25] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Proc. NIPS, pages 2017-2025, 2015.

[26] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, pages 694-711, 2016.[26] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, pages 694-711, 2016.

[27] A. Kanazawa, M.J. Black, D.W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.[27] A. Kanazawa, M.J. Black, D.W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[28] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.[28] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

[29] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nieβner, P. P'erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt. Deep video portraits. arXiv preprint arXiv:1805.11714, 2018.[29] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nieβner, P. P'erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt. Deep video portraits. arXiv preprint arXiv: 1805.11714, 2018.

[30] O. Klehm, F. Rousselle, M. Papas, D. Bradley, C. Hery, B. Bickel, W. Jarosz, and T. Beeler. Recent advances in facial appearance capture. In Computer Graphics Forum, volume 34, pages 709-733. Wiley Online Library, 2015.[30] O. Klehm, F. Rousselle, M. Papas, D. Bradley, C. Hery, B. Bickel, W. Jarosz, and T. Beeler. Recent advances in facial appearance capture. In Computer Graphics Forum, volume 34, pages 709-733. Wiley Online Library, 2015.

[31] V.S. Lempitsky and D.V. Ivanov. Seamless mosaicing of image-based texture maps. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA, 2007.[31] V.S. Lempitsky and D.V. Ivanov. Seamless mosaicing of image-based texture maps. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA, 2007.

[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll'ar, and C.L. Zitnick. Microsoft COCO: Common objects in context. In European conference on computer vision, pages 740-755. Springer, 2014.[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll'ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European conference on computer vision, pages 740-755. Springer, 2014.

[33] L. Liu,W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang, and C. Theobalt. Neural animation and reenactment of human actor videos. arXiv preprint arXiv:1809.03658, 2018.[33] L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang, and C. Theobalt. Neural animation and reenactment of human actor videos. arXiv preprint arXiv: 1809.03658, 2018.

[34] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.[34] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37 (4): 68, 2018.

[35] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.[35] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34 (6): 248, 2015.

[36] M. Mori. The uncanny valley. Energy, 7(4):33-35, 1970.[36] M. Mori. The uncanny valley. Energy, 7 (4): 33-35, 1970.

[37] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated hands for realtime 3d hand tracking from monocular rgb. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[37] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated hands for realtime 3d hand tracking from monocular rgb. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[38] N. Neverova, R.A. G¨uler, and I. Kokkinos. Dense pose transfer. In the European Conference on Computer Vision (ECCV), September 2018.[38] N. Neverova, R.A. G¨uler, and I. Kokkinos. Dense pose transfer. In the European Conference on Computer Vision (ECCV), September 2018.

[39] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[39] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[40] G. Pons-Moll, J. Romero, N. Mahmood, and M.J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG), 34(4):120, 2015.[40] G. Pons-Moll, J. Romero, N. Mahmood, and M.J. Black Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG), 34 (4): 120, 2015.

[41] A. Rav-Acha, P. Kohli, C. Rother, and A.W. Fitzgibbon. Unwrap mosaics: a new representation for video editing. ACM Trans. Graph., 27(3):17:1-17:11, 2008. 3[41] A. Rav-Acha, P. Kohli, C. Rother, and A.W. Fitzgibbon. Unwrap mosaics: a new representation for video editing. ACM Trans. Graph., 27 (3): 17: 1-17: 11, 2008.3

[42] N. Robertini, D. Casas, E. De Aguiar, and C. Theobalt. Multi-view performance capture of surface details. International Journal of Computer Vision, 124(1):96-113, 2017.[42] N. Robertini, D. Casas, E. De Aguiar, and C. Theobalt. Multi-view performance capture of surface details. International Journal of Computer Vision, 124 (1): 96-113, 2017.

[43] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In the European Conference on Computer Vision (ECCV), September 2018.[43] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In the European Conference on Computer Vision (ECCV), September 2018.

[44] A. Siarohin, E. Sangineto, S. Lathuilire, and N. Sebe. Deformable gans for pose-based human image generation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[44] A. Siarohin, E. Sangineto, S. Lathuilire, and N. Sebe. Deformable gans for pose-based human image generation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[45] J. Starck and A. Hilton. Model-based multiple view reconstruction of people. In IEEE International Conference on Computer Vision (ICCV), pages 915-922, 2003.[45] J. Starck and A. Hilton. Model-based multiple view reconstruction of people. In IEEE International Conference on Computer Vision (ICCV), pages 915-922, 2003.

[46] I. Stavness, C.A. S'anchez, J. Lloyd, A. Ho, J.Wang, S. Fels, and D. Huang. Unified skinning of rigid and deformable models for anatomical simulations. In SIGGRAPH Asia 2014 Technical Briefs, page 9. ACM, 2014.[46] I. Stavness, C.A. S'anchez, J. Lloyd, A. Ho, J. Wang, S. Fels, and D. Huang. Unified skinning of rigid and deformable models for anatomical simulations. In SIGGRAPH Asia 2014 Technical Briefs, page 9. ACM, 2014.

[47] D. Sungatullina, E. Zakharov, D. Ulyanov, and V. Lempitsky. Image manipulation with perceptual discriminators. In the European Conference on Computer Vision (ECCV), September 2018.[47] D. Sungatullina, E. Zakharov, D. Ulyanov, and V. Lempitsky. Image manipulation with perceptual discriminators. In the European Conference on Computer Vision (ECCV), September 2018.

[48] S. Suwajanakorn, S.M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.[48] S. Suwajanakorn, S.M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36 (4): 95, 2017.

[49] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The Vitruvian manifold: Inferring dense correspondences for one shot human pose estimation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 103-110. IEEE, 2012. 3[49] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The Vitruvian manifold: Inferring dense correspondences for one shot human pose assessment. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 103-110. IEEE, 2012.3

[50] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.[50] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[51] D. Ulyanov, V. Lebedev, A. Vedaldi, and V.S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In Proc. ICML, pages 1349-1357, 2016.[51] D. Ulyanov, V. Lebedev, A. Vedaldi, and V.S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In Proc. ICML, pages 1349-1357, 2016.

[52] M. Volino, D. Casas, J.P. Collomosse, and A. Hilton. Optimal representation of multi-view video. In Proc. BMVC, 2014.[52] M. Volino, D. Casas, J.P. Collomosse, and A. Hilton. Optimal representation of multi-view video. In Proc. BMVC, 2014.

[53] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.[53] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. arXiv preprint arXiv: 1808.06601, 2018.

[54] L. Wei, L. Hu, V. Kim, E. Yumer, and H. Li. Real-time hair rendering using sequential adversarial networks. In the European Conference on Computer Vision (ECCV), September 2018.[54] L. Wei, L. Hu, V. Kim, E. Yumer, and H. Li. Real-time hair rendering using sequential adversarial networks. In the European Conference on Computer Vision (ECCV), September 2018.

[55] A. Weiss, D. Hirshberg, and M. J. Black. Home 3d body scans from noisy image and range data. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1951-1958. IEEE, 2011.[55] A. Weiss, D. Hirshberg, and M. J. Black. Home 3d body scans from noisy image and range data. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1951-1958. IEEE, 2011.

[56] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H.W. Jensen, et al. Analysis of human faces using a measurement-based skin reflectance model. In ACM Transactions on Graphics (TOG), volume 25, pages 1013-1024. ACM, 2006.[56] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H.W. Jensen, et al. Analysis of human faces using a measurement-based skin reflectance model. In ACM Transactions on Graphics (TOG), volume 25, pages 1013-1024. ACM, 2006.

[57] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In the European Conference on Computer Vision (ECCV), September 2018.[57] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In the European Conference on Computer Vision (ECCV), September 2018.

[58] E. Wood, T. Baltrusaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling. Rendering of eyes for eye-shape registration and gaze estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3756-3764, 2015.[58] E. Wood, T. Baltrusaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling. Rendering of eyes for eye-shape registration and gaze estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3756-3764, 2015.

[59] F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H.-P. Seidel, J. Kautz, and C. Theobalt. Video-based characters: creating new human performances from a multi-view video database. ACM Transactions on Graphics (TOG), 30(4):32, 2011.[59] F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H.-P. Seidel, J. Kautz, and C. Theobalt. Video-based characters: creating new human performances from a multi-view video database. ACM Transactions on Graphics (TOG), 30 (4): 32, 2011.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Целью настоящего изобретения является создание системы и способа для синтеза двумерного изображения человека.The aim of the present invention is to provide a system and method for the synthesis of a two-dimensional image of a person.

Настоящее изобретение позволяет обеспечить следующие преимущества:The present invention provides the following advantages:

- лучшее обобщение по сравнению с системами, использующими прямые сверточные преобразования между положениями суставов и значениями пикселей;- a better generalization compared to systems using direct convolutional transformations between the positions of the joints and the pixel values;

- высоко реалистичные визуализации;- highly realistic renderings;

- повышенную реалистичность создаваемых изображений;- increased realism of the created images;

- более быстрое обучение по сравнению с методом прямого преобразования.- Faster training compared to the direct conversion method.

Согласно одному аспекту настоящего изобретения предложен способ синтеза двумерного изображения человека, заключающийся в том, что: принимают (S101) трехмерные координаты положений суставов тела человека, заданные в системе координат камеры, причем трехмерные координаты положений суставов тела задают позу человека и точку обзора данного двумерного изображения; предсказывают (S102), используя обученный предиктор машинного обучения, стек карт назначений частей тела и стек карт координат частей тела на основе трехмерных координат положений суставов тела, причем стек карт координат частей тела задает координаты текстуры пикселей частей тела человека, стек карт назначений частей тела задает веса, причем каждый вес указывает вероятность того, что конкретный пиксель принадлежит конкретной части тела человека; извлекают (S103) из памяти ранее инициализированный стек карт текстур для частей тела человека, причем стек карт текстур содержит значения пикселей частей тела данного человека; восстанавливают (S104) двумерное изображение человека как взвешенную комбинацию значений пикселей, используя стек карт назначений частей тела, стек карт координат частей тела и стек карт текстур.According to one aspect of the present invention, there is provided a method for synthesizing a two-dimensional image of a person, the method comprising: receiving (S101) three-dimensional coordinates of the positions of the joints of the human body defined in the coordinate system of the camera, the three-dimensional coordinates of the positions of the joints of the body defining the pose of the person and the viewpoint of this two-dimensional image ; predict (S102), using a trained predictor of machine learning, a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts based on the three-dimensional coordinates of the positions of the joints of the body, the stack of maps of the coordinates of the parts of the body sets the coordinates of the pixel texture of the parts of the body, the stack of maps of the parts of the body sets weight, each weight indicating the probability that a particular pixel belongs to a particular part of the human body; extracting (S103) from the memory a previously initialized texture map stack for parts of the human body, wherein the texture map stack contains pixel values of the human body parts; reconstructing (S104) a two-dimensional image of a person as a weighted combination of pixel values using a stack of maps of the parts of the body, a stack of maps of the coordinates of the parts of the body and a stack of texture maps.

В дополнительном аспекте при получении обученного предиктора машинного обучения и ранее инициализированного стека карт текстур для частей тела человека: принимают (S201) множество изображений человека в разных позах и с разных точек обзора; получают (S202) трехмерные координаты положений суставов тела человека, заданные в системе координат камеры для каждого изображения из принятого множества изображений; инициализируют (S203) предиктор машинного обучения на основании трехмерных координат положений суставов тела и принятого множества изображений для получения параметров для предсказания стека карт назначений частей тела и стека карт координат частей тела; инициализируют (S204) стек карт текстур на основании трехмерных координат положений суставов тела и принятого множества изображений и сохраняют стек карт текстур в памяти; предсказывают (S205), используя текущее состояние предиктора машинного обучения, стек карт назначений частей тела и стек карт координат частей тела на основании трехмерных координат положений суставов тела; восстанавливают (S206) двумерное изображение человека как взвешенную комбинацию значений пикселей, используя стек карт назначений частей тела, стек карт координат частей тела и стек карт текстур, хранящийся в памяти; сравнивают (S207) восстановленное двумерное изображения с соответствующим истинным двумерным изображением из принятого множества изображений, чтобы выявить ошибку восстановления двумерного изображения; обновляют (S208) параметры обученного предиктора машинного обучения и значения пикселей в стеке карт текстур на основании результата сравнения; и повторяют этапы S205-S208 для восстановления разных двумерных изображений человека до тех пор, пока не будет удовлетворено заданное условие, представляющее собой по меньшей мере одно из выполнения заданного количества повторений, истечения заданного времени или отсутствия уменьшения ошибки восстановления двумерного изображения человека.In an additional aspect, upon receipt of a trained predictor of machine learning and a previously initialized stack of texture maps for parts of the human body: take (S201) a lot of human images in different poses and from different points of view; receive (S202) three-dimensional coordinates of the positions of the joints of the human body specified in the camera coordinate system for each image from the received set of images; initialize (S203) a machine learning predictor based on three-dimensional coordinates of the positions of the joints of the body and the received plurality of images to obtain parameters for predicting the stack of maps of the assignments of body parts and the stack of maps of the coordinates of body parts; initialize (S204) the texture map stack based on the three-dimensional coordinates of the joints of the body and the received plurality of images and store the texture map stack in memory; predict (S205) using the current state of the machine learning predictor, a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts based on three-dimensional coordinates of the positions of the joints of the body; reconstructing (S206) a two-dimensional image of a person as a weighted combination of pixel values using a stack of maps of the assignments of body parts, a stack of maps of the coordinates of body parts and a stack of texture maps stored in memory; comparing (S207) the reconstructed two-dimensional image with the corresponding true two-dimensional image from the received plurality of images in order to detect a reconstruction error of the two-dimensional image; updating (S208) the parameters of the trained predictor of machine learning and the pixel values in the texture map stack based on the comparison result; and repeating steps S205-S208 for restoring different two-dimensional images of a person until a predetermined condition is met, which is at least one of the execution of a given number of repetitions, the expiration of a predetermined time, or the absence of a reduction in the error of restoration of a two-dimensional image of a person.

В другом дополнительном аспекте предиктор машинного обучения представляет собой одну из глубокой нейронной сети, глубокой сверточной нейронной сети, глубокой полносверточной нейронной сети, глубокой нейронной сети, обученной с функцией потерь восприятия, глубокой нейронной сети, обученной на генеративно-состязательной основе.In another additional aspect, the predictor of machine learning is one of a deep neural network, a deep convolutional neural network, a deep full-convolutional neural network, a deep neural network trained with a loss of perception function, a deep neural network trained on a generative-adversarial basis.

В еще одном дополнительном аспекте способ дополнительно содержит этапы, на которых: генерируют стек карт растрированных сегментов на основе трехмерных координат положений суставов тела, причем каждая карта из стека карт растрированных сегментов содержит растрированный сегмент, представляющий часть тела человека, при этом предсказание стека карт назначений частей тела и стека карт координат частей тела основано на стеке карт растрированных сегментов.In yet a further aspect, the method further comprises the steps of: generating a stack of rasterized segment cards based on three-dimensional coordinates of the positions of the joints of the body, each card from a stack of rasterized segment cards containing a rasterized segment representing a part of a person’s body, while predicting a stack of part destination maps bodies and stack maps of coordinates of body parts based on a stack of maps rasterized segments.

В другом дополнительном аспекте обученный предиктор машинного обучения переобучают для другого человека на основании множества изображений другого человека.In another further aspect, a trained predictor of machine learning is retraining for another person based on a plurality of images of another person.

Согласно другому аспекту настоящего изобретения предложена система для синтеза двумерного изображения человека, содержащая процессор и память, содержащую команды, побуждающие процессор выполнять способ синтеза двумерного изображения человека.According to another aspect of the present invention, there is provided a system for synthesizing a two-dimensional image of a person, comprising a processor and a memory containing instructions prompting the processor to perform a method for synthesizing a two-dimensional image of a person.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Представленные выше и другие аспекты, существенные признаки и преимущества настоящего изобретения будут более понятны из следующего подробного описания в совокупности с прилагаемыми чертежами, на которых:The above and other aspects, essential features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:

фиг. 1 изображает результаты текстурированного нейронного аватара для точек обзора, невидимых во время обучения;FIG. 1 shows the results of a textured neural avatar for viewpoints invisible during training;

фиг. 2 - общий вид системы текстурированного нейронного аватара;FIG. 2 is a general view of a textured neural avatar system;

фиг. 3 - блок-схема, иллюстрирующая один вариант осуществления способа синтеза двумерного изображения человека;FIG. 3 is a flowchart illustrating one embodiment of a method for synthesizing a two-dimensional image of a person;

фиг. 4 - блок-схема, иллюстрирующая процесс получения обученного предиктора машинного обучения и стека карт текстур для частей тела человека.FIG. 4 is a flowchart illustrating the process of obtaining a trained predictor of machine learning and a stack of texture maps for parts of the human body.

В следующем описании используются одни и те же ссылочные обозначения для одинаковых элементов, изображенных на разных чертежах, если не указано иное, и их описание не дублируется.In the following description, the same reference signs are used for the same elements shown in different drawings, unless otherwise indicated, and their description is not duplicated.

ПОДРОБНОЕ ОПИСАНИЕ ВАРИАНТОВ ОСУЩЕСТВЛЕНИЯ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Дальнейшее описание со ссылками на прилагаемые чертежи представлено, чтобы обеспечить исчерпывающее понимание разных вариантов осуществления настоящего изобретения, охарактеризованного формулой изобретения и ее эквивалентами. Оно включает в себя различные конкретные детали, способствующие этому пониманию, но являющиеся всего лишь примерными. Соответственно, специалистам будет понятно, что возможны различные изменения и модификации вариантов осуществления изобретения, описанных в данном документе, не выходящие за рамки объема настоящего раскрытия. Кроме того, для ясности и краткости изложения могут быть опущены описания общеизвестных функций и конструкций.The following description is provided with reference to the accompanying drawings in order to provide a thorough understanding of the various embodiments of the present invention described by the claims and their equivalents. It includes various specific details that contribute to this understanding, but are merely exemplary. Accordingly, those skilled in the art will understand that various changes and modifications to the embodiments of the invention described herein are possible without departing from the scope of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.

Термины и слова, используемые в следующем описании и формуле изобретения, не ограничены их библиографическими значениями, а просто используются автором для обеспечения ясного и правильного понимания настоящего раскрытия. Соответственно, специалистам будет понятно, что следующее описание разных вариантов осуществления изобретения представлено только в целях иллюстрации.The terms and words used in the following description and claims are not limited to their bibliographic meanings, but are merely used by the author to provide a clear and correct understanding of the present disclosure. Accordingly, those skilled in the art will understand that the following description of various embodiments of the invention is provided for illustrative purposes only.

Следует понимать, что элементы, упомянутые в единственном числе, могут быть представлены несколькими элементами, если в контексте явно не указано иное.It should be understood that the elements mentioned in the singular may be represented by several elements, unless the context clearly indicates otherwise.

Хотя в отношении элементов настоящего изобретения могут использоваться термины "первый", "второй" и т.д., понятно, что не следует истолковывать такие элементы как ограниченные данными терминами. Эти термины используются только для отличия одного элемента от других элементов.Although the terms “first,” “second,” etc. may be used with respect to the elements of the present invention, it is understood that such elements should not be construed as limited to these terms. These terms are used only to distinguish one element from other elements.

Кроме того, термины "содержит", "содержащий", "включает" и/или "включающий" в данном контексте указывают на наличие заявленных признаков, целых чисел, операций, элементов и/или компонентов, но не исключают наличия или добавления одного или нескольких других признаков, целых чисел, операций, элементов, компонентов и/или их групп.In addition, the terms “comprises”, “comprising”, “includes” and / or “including” in this context indicate the presence of declared features, integers, operations, elements and / or components, but do not exclude the presence or addition of one or more other signs, integers, operations, elements, components and / or groups thereof.

В разных вариантах осуществления настоящего изобретения "модуль" или "блок" может выполнять по меньшей мере одну функцию или операцию и может быть реализован аппаратными средствами, программным обеспечением или их комбинацией. "Множество модулей" или "множество блоков" может быть реализовано по меньшей мере одним процессором (не показан) посредством его интеграции с по меньшей мере одним модулем, отличным от "модуля" или "блока", который должен быть реализован с помощью специального аппаратного средства.In various embodiments of the present invention, a “module” or “block” may perform at least one function or operation, and may be implemented in hardware, software, or a combination thereof. A "plurality of modules" or a "plurality of blocks" can be implemented by at least one processor (not shown) by integrating it with at least one module other than a "module" or a "block", which must be implemented using special hardware .

Предложена система для обучения созданию нейронных аватаров всего тела. Система обучает глубокую сеть создавать визуализации всего тела человека для разных поз человека и положений камеры. Глубокой сетью может быть любая из глубокой нейронной сети, глубокой сверточной нейронной сети, глубокой полносверточной нейронной сети, глубокой нейронной сети, обученной с функцией потерь восприятия, глубокой нейронной сети, обученной на генеративно-состязательной основе. В процессе обучения система явно оценивает двумерную текстуру, описывающую внешний вид поверхности тела. Сохраняя явную оценку текстуры, система обходит явную оценку трехмерной геометрии кожи (поверхности) в любой момент. Вместо этого, во время тестирования система прямо преобразует конфигурацию характерных точек тела относительно камеры в координаты двумерной текстуры отдельных пикселей в кадре изображения. Система способна научиться создавать высоко реалистичные визуализации, обучаясь при этом на монокулярных видео. Сохранение явного представления текстуры в рамках архитектуры настоящего изобретения помогает ей обеспечить лучшее обобщение по сравнению с системами, которые используют прямые сверточные преобразования между положениями суставов и значениями пикселей.A system for teaching the creation of neural avatars of the whole body is proposed. The system teaches a deep network to create visualizations of the entire human body for different human postures and camera positions. A deep network can be any of a deep neural network, a deep convolutional neural network, a deep full-convolutional neural network, a deep neural network trained with the function of perception loss, a deep neural network trained on a generative-competitive basis. In the learning process, the system clearly evaluates the two-dimensional texture that describes the appearance of the surface of the body. While maintaining an explicit assessment of texture, the system bypasses an explicit assessment of the three-dimensional geometry of the skin (surface) at any time. Instead, during testing, the system directly converts the configuration of the characteristic points of the body relative to the camera into the coordinates of the two-dimensional texture of the individual pixels in the image frame. The system is able to learn how to create highly realistic visualizations, while learning on monocular videos. Maintaining an explicit representation of the texture within the architecture of the present invention helps it provide a better generalization compared to systems that use direct convolutional transformations between joint positions and pixel values.

Настоящее изобретение демонстрирует, как можно использовать современные глубокие сети для синтеза видео всего тела человека со свободной точкой обзора. Построенный нейронный аватар, который управляется трехмерными положениями суставов человека, может синтезировать изображение для произвольной камеры и обучается по набору монокулярных видео в открытом доступе (или даже единственного длинного видео).The present invention demonstrates how modern deep networks can be used to synthesize videos of the entire human body with a free viewpoint. The constructed neural avatar, which is controlled by the three-dimensional positions of the human joints, can synthesize an image for an arbitrary camera and is trained in a set of monocular videos in the public domain (or even a single long video).

Для цели настоящего изобретения под аватаром всего тела подразумевается система, которая способна визуализировать виды определенного человека в изменяющейся позе, заданной набором трехмерных положений суставов тела и разных положений камеры (фиг. 1). На фиг. 1 показаны результаты текстурированного нейронного аватара (без пост-обработки "видео в видео") для разных точек обзора во время обучения. Позиции 1-6 обозначают разные точки обзора камеры и изображения, полученные с точек обзора 1-6. В нижнем ряду на фиг. 1 изображения слева получены путем обработки ввода позы, показанной справа. В качестве входных данных используются положения суставов тела, а не углы суставов, поскольку такие положения легче оценить по данным с использованием систем захвата движения с маркером или без маркера. Для построения классического ("не нейронного") аватара по принципу стандартного конвейера компьютерной графики берут персонифицированную сетку тела пользователя в нейтральном положении, оценивают углы суставов по положениям сустава, выполняют скиннинг (деформирование нейтральной позы), оценивая тем самым трехмерную геометрию тела. Затем применяют преобразование текстуры с использованием предварительно вычисленной двумерной текстуры. И наконец, полученную текстурированную модель освещают с помощью определенной модели освещения и затем проецируют на поле зрения камеры. Таким образом, создание аватара человека в классическом конвейере требует персонализации процесса скиннинга, отвечающего за геометрию, и текстуры, отвечающей за внешний вид.For the purpose of the present invention, an avatar of the whole body is understood to mean a system that is capable of visualizing the views of a certain person in a changing position defined by a set of three-dimensional positions of the joints of the body and different positions of the camera (Fig. 1). In FIG. 1 shows the results of a textured neural avatar (without post-processing "video in video") for different viewpoints during training. Positions 1-6 indicate different points of view of the camera and images obtained from viewpoints 1-6. In the bottom row of FIG. 1 image on the left is obtained by processing the input of the pose shown on the right. The position of the joints of the body, rather than the angles of the joints, is used as input, since such positions are easier to estimate from data using motion capture systems with or without a marker. To build a classic (“non-neural”) avatar, a personalized mesh of the user's body in the neutral position is taken according to the principle of a standard computer graphics pipeline, the angles of the joints are estimated by the positions of the joint, skinning (deformation of the neutral posture) is performed, thereby evaluating the three-dimensional geometry of the body. Then apply texture conversion using pre-computed two-dimensional texture. Finally, the resulting textured model is illuminated using a specific lighting model and then projected onto the camera’s field of view. Thus, the creation of a person’s avatar in a classic pipeline requires the personalization of the skinning process, which is responsible for the geometry, and the texture, which is responsible for the appearance.

В разрабатываемых системах аватаров на основе глубоких сетях (нейронных аватаров) авторы стремятся сократить ряд ступеней классического конвейера и заменить их одной сетью, которая обучается преобразованию из ввода (расположение суставов тела) в вывод (двумерное изображение). Чтобы упростить задачу обучения, представление ввода можно дополнить дополнительными изображениями, например, результатами классического конвейера в [29, 33] или представлением DensePose [21] в [53]. Большое количество изучаемых параметров, способность обучаться по длинным видео и гибкость глубоких сетей позволяет нейронным аватарам моделировать внешний вид деталей, являющихся очень сложными для классического конвейера, таких как волосы, кожа, сложная одежда и очки, и т.д. Кроме того, привлекает концептуальная простота такого принципа "черного ящика". В то же время, хотя глубокие сети легко приспосабливаются к обучающим данным, отсутствие встроенных инвариантностей, характерных для модели, и объединение оценки формы и внешнего вида ограничивают способность таких систем к обобщению. В результате, предыдущие методы нейронной визуализации ограничивались либо частью тела (голова и плечи [29]), и/или конкретным полем зрения камеры [1, 9, 53, 33].In the developed avatar systems based on deep networks (neural avatars), the authors strive to reduce the number of steps of the classical conveyor and replace them with one network that learns how to convert from input (location of body joints) to output (two-dimensional image). To simplify the learning task, the input representation can be supplemented with additional images, for example, the results of the classic pipeline in [29, 33] or the DensePose [21] representation in [53]. A large number of parameters studied, the ability to learn from long videos and the flexibility of deep networks allows neural avatars to simulate the appearance of parts that are very complex for a classic conveyor, such as hair, skin, complex clothes and glasses, etc. In addition, the conceptual simplicity of such a black box principle is appealing. At the same time, although deep networks can easily adapt to training data, the lack of built-in invariance characteristic of the model and the combination of shape and appearance estimates limit the ability of such systems to generalize. As a result, previous methods of neural imaging were limited to either a part of the body (head and shoulders [29]) and / or a specific field of view of the camera [1, 9, 53, 33].

Предложенная система выполняет визуализацию всего тела и объединяет идеи классической компьютерной графики, а именно разделение геометрии и текстуры, с использованием глубоких сетей. В частности, подобно классическому конвейеру, система осуществляет явную оценку двумерных текстур частей тела. Двумерная текстура в классическом конвейере эффективно переносит внешний вид фрагментов тела через преобразования камерой и повороты тела. Следовательно, сохранение этого компонента в нейронном конвейере способствует обобщению таких преобразований. Роль глубокой сети в предложенном методе сводится к предсказанию координат текстур отдельных пикселей с учетом позы тела и параметров камеры (фиг. 2). Кроме того, глубокие сети предсказывают маску переднего плана/фона.The proposed system implements the visualization of the whole body and combines the ideas of classical computer graphics, namely the separation of geometry and texture, using deep networks. In particular, like a classical conveyor, the system provides an explicit assessment of two-dimensional textures of body parts. The two-dimensional texture in the classic conveyor effectively transfers the appearance of body fragments through camera transformations and body rotations. Therefore, the preservation of this component in the neural pipeline contributes to the generalization of such transformations. The role of the deep network in the proposed method comes down to predicting the texture coordinates of individual pixels, taking into account the body pose and camera parameters (Fig. 2). In addition, deep nets predict a foreground / background mask.

Сравнение эффективности текстурированного нейронного аватара, обеспечиваемого настоящим изобретением, с методом прямого преобразования "видео в видео" [53] показывает, что явная оценка текстур обеспечивает дополнительную способность обобщения и значительно улучшает реалистичность создаваемых изображений для новых видов. Существенные преимущества, обеспечиваемые настоящим изобретением, заключаются в том, что явное разделение текстур и геометрии позволяет получить сценарии обучения переноса, когда глубокая сеть переучивается на нового человека с небольшим количеством обучающих данных. И наконец, текстурированный нейронный аватар значительно ускоряет время обучения по сравнению с методом прямого преобразования.A comparison of the effectiveness of the textured neural avatar provided by the present invention with the direct video-to-video conversion method [53] shows that explicit texture evaluation provides additional generalization ability and significantly improves the realism of created images for new species. The significant advantages provided by the present invention are that the explicit separation of textures and geometry allows one to obtain transfer training scenarios when a deep network is retrained for a new person with a small amount of training data. Finally, a textured neural avatar significantly speeds up learning time compared to the direct conversion method.

СпособыWays

Обозначения. Нижний индекс i используется для обозначения объектов, относящихся к i-му обучающему или тестовому изображению. Верхний индекс, например,

, обозначает стек карт (тензор третьего порядка/трехмерный массив), соответствующий i-му обучающему или тестовому изображению. Верхний индекс используется для обозначения конкретной карты (канала) в стеке, например

. Квадратные скобки используются для обозначения элементов, соответствующих конкретному положению изображения, например,

обозначает скалярный элемент в j-й карте стека

, находящейся в положении

, а

обозначает вектор элементов, соответствующий всем картам, выбранным в положении

.Designations. The subscript i is used to denote objects related to the i-th training or test image. Superscript for example

, denotes a map stack (third-order tensor / three-dimensional array) corresponding to the ith training or test image. The superscript is used to indicate a specific card (channel) in the stack, for example

. Square brackets are used to indicate elements that correspond to a specific position of the image, for example,

denotes a scalar element in the jth map of the stack

in position

, a

denotes a vector of elements corresponding to all cards selected in position

.

Ввод и вывод. В общем, требуется синтезировать изображения определенного человека с учетом его позы. Предполагается, что поза для i-го изображения поступает в виде трехмерных положений суставов, заданных в системе координат камеры. Тогда в качестве ввода в глубокую сеть рассматривается стек карт

, в котором каждая карта

содержит растрированный j-й сегмент (кость) "фигуры человека" (скелета), спроецированный на плоскость камеры. Для фиксации информации о третьей координате суставов осуществляется линейная интерполяция значения глубины между суставами, задающими сегменты, и интерполированные значения используются для задания значений на карте

, соответствующих пикселям кости (пиксели, не закрытые j-й костью обнуляются). В целом, стек

включает информацию о человеке и положении камеры.Input and output. In general, it is required to synthesize images of a certain person taking into account his posture. It is assumed that the pose for the i-th image comes in the form of three-dimensional positions of the joints specified in the camera coordinate system. Then, as an input to a deep network, a map stack is considered

in which each card

contains a rasterized j-th segment (bone) of a “human figure” (skeleton) projected onto the plane of the camera. To capture information about the third coordinate of the joints, linear interpolation of the depth value between the joints defining the segments is performed, and the interpolated values are used to set the values on the map

corresponding to the pixels of the bone (pixels not covered by the j-th bone are reset). Overall stack

Includes information about the person and camera position.

В качестве вывода глубокая сеть создает изображение RGB (трехканальный стек)

и одноканальную маску

, задающую пиксели, которые покрыты аватаром. Во время обучения предполагается, что для каждого входного кадра i оцениваются введенные положения суставов и маска "истинного" переднего плана, и для их извлечения из необработанных видеокадров используются оценка трехмерной позы тела и семантическая сегментация человеком. Во время тестирования, учитывая истинное или синтетическое изображение фона

, создается окончательный вид сначала посредством предсказания

and

по позе тела, а затем посредством линейного вливания полученного аватара в изображение:

(где

задает произведение "по положению", при котором RGB значения в каждом положении умножаются на значение маски в этом положении).As an output, a deep network creates an RGB image (three-channel stack)

and single channel mask

defining the pixels that are covered by the avatar. During the training, it is assumed that for each input frame i the entered positions of the joints and the mask of the “true” foreground are evaluated, and three-dimensional body posture and semantic segmentation by a person are used to extract them from the raw video frames. During testing, given a true or synthetic background image

, the final view is created first by predicting

and

by body position, and then by linear pouring the received avatar into the image:

(Where

sets the product "by position" at which the RGB values in each position are multiplied by the mask value in this position).

Прямое преобразование. Прямой подход, который рассматривается как основа настоящего изобретения, состоит в том, чтобы обучать глубокую сеть как сеть преобразования изображений, которая преобразует стек

карт в стеки карт

и

(обычно генерируются два выходных стека в двух ветвях, которые совместно осуществляют начальный этап обработки [15]). Как правило, преобразования между стеками карт можно реализовать с помощью глубокой сети, например, полносверточных архитектур. Точные архитектуры и потери для таких сетей активно исследуются [14, 51, 26, 24, 10]. В самых последних работах [1, 9, 53, 33] использовалось прямое преобразование (с разными модификациями) для синтеза вида человека для фиксированной камеры. В частности, система "видео в видео" [53] рассматривает сеть преобразования, которая генерирует следующий кадр видео, принимая за вводы последние три кадра, а также учитывая выводы системы для двух предыдущих кадров авторегрессивным образом. В нашем случае система "видео в видео" [53] модифицирована для получения изображения и маски:Direct conversion. The direct approach, which is considered as the basis of the present invention, is to train a deep network as an image conversion network that transforms the stack

cards to card stacks

and

(usually, two output stacks are generated in two branches, which jointly carry out the initial stage of processing [15]). As a rule, transformations between card stacks can be implemented using a deep network, for example, full-convolutional architectures. Exact architectures and losses for such networks are being actively investigated [14, 51, 26, 24, 10]. In the most recent works [1, 9, 53, 33], direct conversion (with various modifications) was used to synthesize a human species for a fixed camera. In particular, the video-in-video system [53] considers a conversion network that generates the next video frame, taking the last three frames as inputs, and also taking into account the system outputs for the previous two frames in an autoregressive manner. In our case, the video-in-video system [53] is modified to obtain an image and a mask:

, (1)

Здесь

- регрессионная сеть "видео в видео" с обучаемыми параметрами

. Также предполагается, что обучающие или тестовые примеры i-1 и i-2 соответствуют предшествующим кадрам. Полученная система преобразования "видео в видео" обеспечивает надежную основу для настоящего изобретения.Here

- regression network "video in video" with training parameters

. It is also assumed that the training or test examples i-1 and i-2 correspond to the previous frames. The resulting video-to-video conversion system provides a reliable basis for the present invention.

Текстурированный нейронный аватар. Принцип прямого преобразования основан на обобщающей способности глубоких сетей и вводит в систему очень мало предметно-ориентированных знаний. В качестве альтернативы применяется метод текстурированного аватара, в котором явно оцениваются текстуры частей тела, обеспечивая тем самым подобие внешнего вида поверхности тела при разных позах и камерах. Согласно принципу DensePose [21], тело разделяется на n частей, каждая из которых имеет двумерную параметризацию. Таким образом, предполагается, что в изображении человека каждый пиксель принадлежит одной из n частей или фону. В первом случае пиксель затем ассоциируется с координатами конкретной двумерной части. k-ая часть тела также ассоциируется с картой текстур

, которая оценивается во время обучения. Оцененные текстуры изучаются во время обучения и используются повторно для всех полей зрения камеры и всех поз.Textured neural avatar. The direct transformation principle is based on the generalizing ability of deep networks and introduces very little domain-specific knowledge into the system. As an alternative, the textured avatar method is used, in which the textures of parts of the body are clearly evaluated, thereby ensuring the similarity of the appearance of the body surface with different poses and cameras. According to the DensePose principle [21], the body is divided into n parts, each of which has a two-dimensional parameterization. Thus, it is assumed that in a human image, each pixel belongs to one of n parts or to a background. In the first case, the pixel is then associated with the coordinates of a particular two-dimensional part. k-th body part is also associated with texture map

, which is evaluated during training. The evaluated textures are studied during training and are reused for all camera fields of view and all poses.

Введение описанной выше параметризации поверхности тела изменяет проблему преобразования. Для заданной позы, заданной

, сеть преобразования теперь должна предсказать стек

назначений частей тела и стек

координат частей тела, где

содержит n+1 карт неотрицательных чисел, которые в сумме дают единицу (то есть

), и

содержит 2n карт реальных чисел от 0 до w, где w - пространственный размер (ширина и высота) карт текстур

.The introduction of the above described parametrization of the body surface changes the problem of transformation. For a given pose, a given

conversion network should now predict the stack

body parts appointments and stack

coordinates of body parts where

contains n + 1 cards of non-negative numbers that add up to one (i.e.

), and

contains 2n maps of real numbers from 0 to w, where w is the spatial size (width and height) of texture maps

.

Затем интерпретируют канал карты

для k=0..n-1 как вероятность того, что пиксель принадлежит k-й части тела, а канал карты

соответствует вероятности фона. Карты

и

координат соответствуют координатам пикселей на k-й части тела. В частности, после предсказания назначений

частей и координат

частей тела изображение

в каждом пикселе

восстанавливается как взвешенная комбинация элементов текстуры, где веса и координаты текстуры предписываются картами назначений частей и картами координат, соответственно:Then map channel is interpreted

for k = 0..n-1 as the probability that the pixel belongs to the kth part of the body and the map channel

corresponds to the probability of the background. Cards

and

coordinates correspond to the coordinates of pixels on the k-th part of the body. In particular, after the prediction of appointments

parts and coordinates

body parts image

in every pixel

reconstructed as a weighted combination of texture elements, where the weights and coordinates of the texture are prescribed by part destination maps and coordinate maps, respectively:

, (2)

где

- функция выборки (слой), которая выдает стек карт RGB с учетом трех введенных аргументов. В работе (2) карты текстур

семплируются в нецелочисленных положениях

билинейным образом, так что

вычисляется как:Where

- a selection function (layer), which produces a stack of RGB cards, taking into account the three entered arguments. In (2) texture maps

sampled in non-integer positions

in a bilinear manner so that

calculated as:

, (3)

для

, как предложено в [25].for

as suggested in [25].

При обучении нейронного текстурированного аватара глубокую сеть

обучают с обучаемыми параметрами

преобразованию входных стеков карт

в назначения частей тела и координаты частей тела. Так как

имеет две ветви (ʺголовыʺ),

является ветвью, которая создает стек назначений частей тела, а

является ветвью, которая создает координаты частей тела. Для изучения параметров текстурированного нейронного аватара оптимизируются потери между созданным изображением и истинным изображением

:When training a neural textured avatar, a deep network

taught with learning parameters

conversion of input card stacks

in the destination of the body parts and the coordinates of the body parts. Because

has two branches (ʺheadsʺ),

is a branch that creates a stack of assignments of body parts, and

is a branch that creates the coordinates of body parts. To study the parameters of a textured neural avatar, the losses between the created image and the true image are optimized.

:

, (4)

где

потери при сравнении двух изображений (точный выбор обсуждается ниже). При стохастической оптимизации градиент потерь (4) распространяется обратно через (2) как в сеть преобразования

, так и на карты текстур

, так что минимизация этой потери обновляет не только параметры сети, но и сами текстуры. Кроме того, обучение также оптимизирует потерю маски, которая измеряет несоответствие между маской истинного фона

и предсказанием маски фона:Where

losses when comparing two images (the exact choice is discussed below). In stochastic optimization, the loss gradient (4) propagates back through (2) as into a transformation network

so on texture maps

so minimizing this loss updates not only the network parameters, but also the textures themselves. In addition, training also optimizes mask loss, which measures the mismatch between the true background mask

and background mask prediction:

, (5)

где

- потеря двоичной перекрестной энтропии, а

соответствует n-му (то есть фоновому) каналу предсказанного стека карт назначений частей. После обратного распространения взвешенной комбинации (4) и (5) параметры сети

и карты текстур

обновляются. В процессе обучения карты текстур изменяются (фиг. 2), а также изменяются и предсказания координат частей тела, поэтому обучение может свободно выбирать соответствующую параметризацию поверхностей частей тела.Where

- loss of binary cross entropy, and

corresponds to the nth (i.e. background) channel of the predicted stack of part assignment maps. After the backward propagation of the weighted combination (4) and (5), the network parameters

and texture maps

updated. During the training process, texture maps change (Fig. 2), as well as the predictions of the coordinates of body parts, so the training can freely choose the appropriate parameterization of the surfaces of the body parts.

На фиг. 2 представлен общий вид системы текстурированного нейронного аватара. Входная поза задается как стек растеризаций "кости" (одна кость на канал; выделена на чертеже красным цветом). Ввод обрабатывается полносверточной сетью (показана оранжевым цветом) для создания стека карт назначений частей тела и стека карт координат частей тела. Эти стеки затем используются для выборки карт текстур тела в положениях, заданных стеком координат частей с весами, заданными стеком назначений частей, чтобы получить изображение RGB. Кроме того, последняя карта из стека карт назначений тела соответствует вероятности фона. Во время обучения маска и изображение RGB сравниваются с истиной, и полученные потери распространяются обратно посредством операции выборки в полносверточную сеть и на текстуру, что приводит к их обновлениям.In FIG. 2 shows a general view of a textured neural avatar system. The input position is defined as a stack of rasterized “bones” (one bone per channel; highlighted in red in the drawing). The input is processed by a full-convolution network (shown in orange) to create a stack of maps of assignments of body parts and a stack of maps of coordinates of body parts. These stacks are then used to fetch body texture maps at positions specified by the parts coordinate stack with weights specified by the parts destination stack to obtain an RGB image. In addition, the last card from the stack of cards of the destination of the body corresponds to the probability of the background. During training, the mask and the RGB image are compared with the truth, and the resulting losses are propagated back through the sampling operation in the full-convolution network and the texture, which leads to their updates.

Постобработка "видео в видео". Хотя модель текстурированного нейронного аватара можно использовать в качестве автономного механизма нейронной визуализации, ее выход подвергается постобработке с помощью модуля обработки "видео в видео", который улучшает временное соответствие и добавляет вариации внешнего вида, зависящие от позы и точки обзора, которые невозможно смоделировать полностью, используя текстурированный нейронный аватар. Таким образом, рассматривается сеть

преобразования видео с обучаемыми параметрами, которая принимает в качестве ввода поток выходов текстурированного нейронного аватара:Post-processing "video to video". Although the textured neural avatar model can be used as a stand-alone neural visualization mechanism, its output is post-processed using the video-in-video processing module, which improves temporal compliance and adds variations in appearance depending on the posture and point of view that cannot be fully modeled. using textured neural avatar. Thus, the network is considered

video transformations with training parameters, which takes as input an output stream of textured neural avatar:

, (6)

означает вывод текстурированного нейронного аватара в момент времени t (снова предполагается, что примеры i-1 и i-2 соответствуют предшествующим кадрам). Ниже представлено сравнение предложенной полной модели, т.е. текстурированного нейронного аватара с пост-обработкой (6), текстурированных нейронных аватаров без такой пост-обработки и базовой модели "видео в видео" (1), которое демонстрирует явное улучшение, достигнутое при использовании текстурированного нейронного аватара (с пост-обработкой или без нее).

means the output of a textured neural avatar at time t (again, it is assumed that examples i-1 and i-2 correspond to the previous frames). Below is a comparison of the proposed complete model, i.e. textured neural avatar with post-processing (6), textured neural avatars without such post-processing and the basic video-in-video model (1), which demonstrates a clear improvement achieved using the textured neural avatar (with or without post-processing) )

Инициализация текстурированного нейронного аватара. Для инициализации текстурированного нейронного аватара используется система DensePose [21]. В частности, имеется два варианта для инициализации

. Во-первых, при большом количестве обучающих данных и их поступлении из нескольких обучающих монокулярных последовательностей DensePose может работать на обучающих изображениях, получая карты назначения частей и карты координат деталей. Затем предварительно обучается

в качестве сети преобразования между стеками поз

и выходами DensePose. В качестве альтернативы можно использовать "универсально" предварительно обученную

, которая обучена преобразовывать стеки поз в выходы DensePose в автономном крупномасштабном наборе данных (авторы используют набор данных COCO [32]).Initialization of a textured neural avatar. To initialize a textured neural avatar, the DensePose system is used [21]. In particular, there are two options for initializing

. Firstly, with a large amount of training data and their arrival from several training monocular sequences, DensePose can work on training images, receiving part assignment maps and part coordinate maps. Then pre-trained

as a network transform between stacks of poses

and DensePose exits. Alternatively, you can use the “universally” pre-trained

, which is trained to convert pose stacks to DensePose outputs in an autonomous large-scale data set (the authors use the COCO data set [32]).

После инициализации преобразования DensePose инициализируются карты текстур

следующим образом. Каждый пиксель в обучающих изображениях приписывается одной части тела (согласно прогнозу

) и конкретному пикселю текстуры на текстуре соответствующей части (согласно прогнозу

). Затем значение каждого пикселя текстуры инициализируется как среднее всех приписанных ему значений изображения (пиксели текстуры, которым назначены нулевые пиксели, инициализируются черным).After DensePose conversion initialization, texture maps are initialized

in the following way. Each pixel in the training images is attributed to one part of the body (according to the forecast

) and a specific pixel of the texture on the texture of the corresponding part (according to the forecast

) Then, the value of each texture pixel is initialized as the average of all image values assigned to it (texture pixels that are assigned zero pixels are initialized in black).

Потери и архитектуры.Losses and architecture.

Потери восприятия (VGG) [51, 26] используются в (4) для измерения различия между созданным и истинным изображением при обучении текстурированного нейронного аватара. Можно использовать любые другие стандартные потери для измерения таких различий.Perceptual loss (VGG) [51, 26] is used in (4) to measure the difference between the created and the true image when training a textured neural avatar. You can use any other standard losses to measure such differences.

Обучение переноса.Transfer Training.

После того, как текстурированный нейронный аватар был обучен для определенного человека на основе большого количества данных, его можно переобучить для другого человека, используя намного меньше данных (так называемое обучение переноса). Во время переобучения новый стек карт текстур переоценивается с помощью описанной выше процедуры инициализации. После этого процесс обучения происходит обычным образом, но с использованием предварительно обученного набора параметров

для инициализации.After a textured neural avatar has been trained for a specific person based on a large amount of data, it can be retrained for another person using much less data (the so-called transfer training). During retraining, a new stack of texture maps is reevaluated using the initialization procedure described above. After this, the learning process occurs in the usual way, but using a pre-trained set of parameters

to initialize.

Один вариант осуществления способа синтеза двумерного изображения человека будет описан более подробно со ссылками на фиг. 3. Способ включает в себя этапы 101, 102, 103, 104.One embodiment of a method for synthesizing a two-dimensional human image will be described in more detail with reference to FIG. 3. The method includes steps 101, 102, 103, 104.

На этапе 101 получают трехмерные координаты положения суставов тела человека, заданные в системе координат камеры. Трехмерные координаты положений суставов тела задают позу человека и точку обзора двумерного изображения, которое должно быть синтезировано.At step 101, three-dimensional coordinates of the position of the joints of the human body are obtained, specified in the coordinate system of the camera. The three-dimensional coordinates of the positions of the joints of the body define the pose of the person and the point of view of the two-dimensional image that must be synthesized.

На этапе 102 обученный предиктор машинного обучения прогнозирует стек карт назначений частей тела и стек карт координат частей тела на основании принятых трехмерных координат положений суставов тела. Стек карт координат частей тела задает координаты текстуры пикселей частей тела человека. Стек карт назначений частей тела задает веса. В стеке карт назначений частей тела каждый вес указывает на вероятность того, что определенный пиксель принадлежит конкретной части тела человека.At step 102, a trained predictor of machine learning predicts a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts based on the received three-dimensional coordinates of the positions of the joints of the body. The stack of maps of the coordinates of the parts of the body sets the coordinates of the pixel texture of the parts of the human body. A stack of body parts assignment cards sets weights. In the stack of maps of the assignments of body parts, each weight indicates the probability that a particular pixel belongs to a specific part of the human body.

На этапе 103 из памяти извлекается ранее инициализированный стек карт текстур для частей тела человека. Этот стек карт текстур содержит значения пикселей частей тела человека.At step 103, a previously initialized stack of texture maps for parts of the human body is retrieved from the memory. This texture map stack contains pixel values of parts of the human body.

На этапе 104 двумерное изображение человека восстанавливается как взвешенная комбинация значений пикселей с использованием стека карт назначений частей тела и стека карт координат частей тела, предсказанных на этапе 102, и стека карт текстур, извлеченных на этапе 103.At step 104, the two-dimensional image of the person is restored as a weighted combination of pixel values using the stack of maps of the assignments of body parts and the stack of maps of the coordinates of the body parts predicted at step 102 and the stack of texture maps extracted at step 103.

В качестве предиктора машинного обучения используется одна из глубокой нейронной сети, глубокой сверточной нейронной сети, глубокой полносверточной нейронной сети, глубокой нейронной сети, обученной с функцией потерь восприятия, глубокой нейронной сети, обученной на генеративно-состязательной основе.As a predictor of machine learning, one of a deep neural network, a deep convolutional neural network, a deep full-convolutional neural network, a deep neural network trained with a loss of perception function, a deep neural network trained on a generative-competitive basis is used.

Процесс получения обученного предиктора машинного обучения и стека карт текстур для частей тела человека будет более подробно описан со ссылками на фиг. 4. В указанном процессе способ содержит этапы 201, 202, 203, 204, 205, 206, 207, 208. Этапы 203 и 204 могут выполняться одновременно или последовательно в любом порядке.The process of obtaining a trained predictor of machine learning and a stack of texture maps for parts of the human body will be described in more detail with reference to FIG. 4. In this process, the method comprises steps 201, 202, 203, 204, 205, 206, 207, 208. Steps 203 and 204 may be performed simultaneously or sequentially in any order.

На этапе 201 принимают множество изображений человека в разных позах и с разных точек обзора.At 201, a plurality of human images are taken in various poses and from different viewpoints.

На этапе 202 получают трехмерные координаты положений суставов тела человека, заданные в системе координат камеры, для каждого изображения из принятого множества изображений. Трехмерные координаты можно получить с использованием любого подходящего метода. Такие методы известны из уровня техники.At step 202, three-dimensional coordinates of the positions of the joints of the human body are obtained, specified in the camera coordinate system, for each image from the received plurality of images. Three-dimensional coordinates can be obtained using any suitable method. Such methods are known in the art.

На этапе 203 предиктор машинного обучения инициализируют на основе трехмерных координат положений суставов тела и принятого множества изображений, чтобы получить параметры для предсказания стека карт назначений частей тела и стека карт координат частей тела.At step 203, the machine learning predictor is initialized based on the three-dimensional coordinates of the positions of the joints of the body and the received plurality of images to obtain parameters for predicting the stack of maps of assignments of body parts and the stack of maps of coordinates of body parts.

На этапе 204 инициализируют стек карт текстур на основе трехмерных координат положений суставов тела и принятого множества изображений, и стек карт текстур сохраняют в памяти.At step 204, a texture map stack is initialized based on three-dimensional coordinates of the joints of the body and the received plurality of images, and the texture map stack is stored in memory.

На этапе 205 предсказывают стек карт назначений частей тела и стек карт координат частей тела с использованием текущего состояния предиктора машинного обучения на основе трехмерных координат положений суставов тела.At step 205, a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts are predicted using the current state of the machine learning predictor based on three-dimensional coordinates of the positions of the body joints.

На этапе 206 восстанавливают двумерное изображение человека как взвешенную комбинацию значений пикселей с использованием стека карт назначений частей тела, стека карт координат частей тела и стека карт текстур, хранящихся в памяти.At step 206, a two-dimensional image of a person is reconstructed as a weighted combination of pixel values using a stack of maps of the assignments of body parts, a stack of maps of coordinates of body parts and a stack of texture maps stored in memory.

На этапе 207 восстановленное двумерное изображение сравнивают с истинным двумерным изображением, чтобы выявить ошибку восстановления двумерного изображения. Истинное двумерное изображение соответствует восстановленному двумерному изображению в данной позе человека и точке обзора. Истинное двумерное изображение выбирается из полученного множества изображений.At step 207, the reconstructed two-dimensional image is compared with the true two-dimensional image in order to detect a reconstruction error of the two-dimensional image. A true two-dimensional image corresponds to a reconstructed two-dimensional image in a given human pose and point of view. The true two-dimensional image is selected from the obtained set of images.

На этапе 208 параметры обученного предиктора машинного обучения и значения пикселей в стеке карт текстур обновляются на основе результата сравнения.At step 208, the parameters of the trained predictor of machine learning and the pixel values in the texture map stack are updated based on the comparison result.

Затем этапы S205-S208 повторяются для восстановления разных двумерных изображений человека, пока не будет выполнено некоторое заданное условие. Заданным условием может быть по меньшей мере одно из выполнения заданного количества повторений, истечения заданного времени или отсутствия уменьшения ошибки восстановления двумерного изображения человека.Then, steps S205-S208 are repeated to restore different two-dimensional images of a person until a predetermined condition is met. A predetermined condition may be at least one of the execution of a given number of repetitions, the expiration of a predetermined time, or the absence of a reduction in a reconstruction error of a two-dimensional image of a person.

Обучение глубоких сетей хорошо известно из уровня техники, поэтому конкретные этапы обучения не описываются подробно.Deep network learning is well known in the art, so the specific stages of learning are not described in detail.

В другом варианте предсказание (S102, S205) стека карт назначений частей тела и стека карт координат частей тела может быть основано на стеке карт растрированных сегментов. Стек карт растрированных сегментов создается на основе трехмерных координат положений суставов тела. Каждая карта из стека карт растрированных сегментов содержит растрированный сегмент, представляющий какую-либо часть тела человека.In another embodiment, the prediction (S102, S205) of the stack of maps of the assignments of body parts and the stack of maps of the coordinates of body parts can be based on a stack of maps of rasterized segments. A stack of maps of rasterized segments is created based on the three-dimensional coordinates of the positions of the joints of the body. Each card from the stacked map of rasterized segments contains a rasterized segment representing any part of the human body.

В еще одном варианте обученный предиктор машинного обучения можно переобучить для другого человека на основе множества изображений этого другого человека.In yet another embodiment, a trained predictor of machine learning can be retrained for another person based on multiple images of that other person.

Все описанные выше операции могут выполняться системой для синтеза двумерного изображения человека. Система для синтеза двумерных изображений человека содержит процессор и память. В памяти хранятся команды, побуждающие процессор реализовать способ синтеза двумерного изображения человека.All the operations described above can be performed by a system for synthesizing a two-dimensional image of a person. The system for the synthesis of two-dimensional images of a person contains a processor and memory. Commands are stored in memory that prompt the processor to implement a method for synthesizing a two-dimensional image of a person.

Поскольку варианты осуществления изобретения были описаны как реализуемые по меньшей мере частично программно-управляемым устройством обработки данных, следует понимать, что постоянный машиночитаемый носитель, содержащий такое программное обеспечение, в частности, оптический диск, магнитный диск, полупроводниковое запоминающее устройство или т.п., также следует рассматривать как вариант осуществления настоящего раскрытия.Since embodiments of the invention have been described as being implemented at least in part by a software-controlled data processing device, it should be understood that a readable computer readable medium containing such software, in particular an optical disk, a magnetic disk, a semiconductor memory device or the like, should also be considered as an embodiment of the present disclosure.

Понятно, что варианты осуществления системы для синтеза двумерного изображения человека можно реализовать в виде разных функциональных блоков, схем и/или процессоров. Однако ясно, что можно использовать любое подходящее распределение функций между разными функциональными блоками, схемами и/или процессорами без ущерба для вариантов осуществления.It is clear that embodiments of the system for the synthesis of a two-dimensional image of a person can be implemented in the form of different functional blocks, circuits, and / or processors. However, it is clear that any suitable distribution of functions between different functional blocks, circuits, and / or processors can be used without prejudice to the embodiments.

Эти варианты могут быть реализованы в любой подходящей форме, включая аппаратные средства, программное обеспечение, встроенное программное обеспечение или любую их комбинацию. Варианты осуществления можно реализовать по меньшей мере частично, как компьютерное программное обеспечение, работающее на одном или нескольких процессорах данных и/или процессорах цифровых сигналов. Элементы и компоненты любого варианта осуществления могут быть реализованы физически, функционально и логически любым подходящим способом. Действительно, выполняемые функции могут быть реализованы в одном блоке, множестве блоков или как часть других функциональных блоков. По существу, предложенные варианты осуществления можно реализовать в одном блоке или распределить физически и функционально между разными блоками, схемами и/или процессорами.These options may be implemented in any suitable form, including hardware, software, firmware, or any combination thereof. Embodiments may be implemented, at least in part, as computer software running on one or more data processors and / or digital signal processors. Elements and components of any embodiment may be implemented physically, functionally, and logically in any suitable manner. Indeed, the functions performed can be implemented in one block, multiple blocks, or as part of other functional blocks. Essentially, the proposed embodiments can be implemented in a single block or distributed physically and functionally between different blocks, circuits, and / or processors.

Представленное выше описание вариантов осуществления изобретения является иллюстративным, и модификации в конфигурации и реализации подпадают под объем настоящего описания. Например, хотя варианты осуществления изобретения описаны в общем со ссылкой на фиг. 1-4, эти описания являются примерными. Несмотря на то, что предмет изобретения был описан на языке, характерном для конструктивных признаков или методологических действий, подразумевается, что предмет, охарактеризованный в прилагаемой формуле изобретения, не обязательно ограничен конкретными признаками или действиями, описанными выше. Напротив, описанные выше конкретные признаки и действия раскрыты как примерные формы реализации формулы изобретения. Кроме того, изобретение не ограничено проиллюстрированным порядком выполнения этапов способа, и этот порядок может быть изменен специалистом без творческих усилий. Некоторые или все этапы способа могут выполняться последовательно или одновременно. Соответственно, объем воплощения изобретения ограничен только следующей формулой изобретения.The above description of embodiments of the invention is illustrative, and modifications in the configuration and implementation fall within the scope of the present description. For example, although embodiments of the invention are described generally with reference to FIG. 1-4, these descriptions are exemplary. Although the subject matter of the invention has been described in a language characteristic of constructive features or methodological actions, it is understood that the subject matter described in the appended claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are disclosed as exemplary forms of implementing the claims. In addition, the invention is not limited to the illustrated order of the steps of the method, and this order can be changed by a specialist without creative efforts. Some or all of the steps of the method may be performed sequentially or simultaneously. Accordingly, the scope of the invention is limited only by the following claims.

Claims

1. The method of synthesis of a two-dimensional image of a person, which consists in the fact that:

receive (S101) three-dimensional coordinates of the positions of the joints of the human body specified in the coordinate system of the camera, and the three-dimensional coordinates of the positions of the joints of the body specify the pose of the person and the point of view of the two-dimensional image;

predict (S102) using a trained predictor of machine learning, a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts based on the three-dimensional coordinates of the positions of the joints of the body, the stack of maps of the coordinates of the parts of the body sets the coordinates of the pixel texture of the parts of the body, the stack of maps of the parts of the body sets weight, and each weight indicates the probability that a particular pixel belongs to a specific part of the human body, while the trained predictor of machine learning is trained for many different human poses and p znyh points human review;

extracting (S103) from the memory a previously initialized texture map stack for parts of the human body, wherein the texture map stack contains pixel values of the human body parts; and

reconstructing (S104) a two-dimensional image of a person as a weighted combination of pixel values using said stack of destination maps of body parts, said stack of maps of coordinates of body parts and said stack of texture maps.

2. The method according to p. 1, in which upon receipt of a trained predictor of machine learning and a previously initialized stack of texture maps for parts of the human body:

take (S201) a variety of images of a person in various poses and from different points of view;

receive (S202) three-dimensional coordinates of the positions of the joints of the human body specified in the camera coordinate system for each image from the received set of images;

initialize (S203) a machine learning predictor based on three-dimensional coordinates of the positions of the joints of the body and the received plurality of images to obtain parameters for predicting the stack of maps of the assignments of body parts and the stack of maps of the coordinates of body parts;

initialize (S204) the texture map stack based on the three-dimensional coordinates of the joints of the body and the received plurality of images and store the texture map stack in memory;

predict (S205) using the current state of the machine learning predictor, a stack of maps of the assignments of body parts and a stack of maps of the coordinates of body parts based on three-dimensional coordinates of the positions of the joints of the body;

reconstructing (S206) a two-dimensional image of a person as a weighted combination of pixel values using a stack of maps of the assignments of body parts, a stack of maps of the coordinates of body parts and a stack of texture maps stored in memory;

comparing (S207) the reconstructed two-dimensional image with the corresponding true two-dimensional image from the received plurality of images in order to detect a reconstruction error of the two-dimensional image;

updating (S208) the parameters of the trained predictor of machine learning and the pixel values in the texture map stack based on the comparison result; and

steps S205-S208 are repeated to restore different two-dimensional images of a person until a predetermined condition is satisfied, the predetermined condition being at least one of a predetermined number of repetitions, the expiration of a predetermined time, or the absence of a reduction in a reconstruction error of a two-dimensional image of a person.

3. The method according to claim 1 or 2, in which the predictor of machine learning is one of a deep neural network, a deep convolutional neural network, a deep full-convolutional neural network, a deep neural network trained with a perceptual loss function, a deep neural network trained on generatively - adversarial basis.

4. The method according to p. 1 or 2, in which additionally:

generating a stack of maps of rasterized segments based on three-dimensional coordinates of the positions of the joints of the body, each map from a stack of maps of rasterized segments containing a rasterized segment representing a part of the human body,

moreover, the prediction of the stack of maps of the assignment of parts of the body and the stack of maps of the coordinates of the parts of the body is based on the stack of maps of rasterized segments.

5. The method according to claim 2, in which the trained predictor of machine learning retrain for another person based on a variety of images of another person.

6. A system for synthesizing a two-dimensional image of a person, containing:

processor and

a memory containing instructions prompting the processor to perform a method for synthesizing a two-dimensional image of a person according to any one of paragraphs. 1-5.