1 Introduction
With recent developments in VR headset technology, wearable computing is becoming more and more a reality with great potential for applications like immersive and social telepresence that no longer relies on complicated, stationary, and spatially constrained external capture setups, e.g. multi-camera rigs. However, this comes with the challenge of faithfully reproducing, animating and re-rendering the real human in the virtual world from only limited body-mounted sensor data. More precisely, this requires 1) turning a real human into an animatable, full-body, and photoreal digital avatar and 2) faithfully tracking the persons motion from limited sensor data of the headset to drive the avatar.
Both of these points pose a significant research challenge and many recent works proposed promising solutions. Creating animatable and photoreal digital twins from real world measurements of a human, e.g. multi-view video [Bagautdinov et al.
2021; Habermann et al.
2023;
2021; Kwon et al.
2023; Liu et al.
2021; Luvizon et al.
2024; Pang et al.
2023; Xiang et al.
2022;
2021] or single view video [Jiang et al.
2022; Weng et al.
2022] has been extensively researched recently, achieving high-quality results while also enabling real-time performance [Zhu et al.
2023]. Follow-up works [Remelli et al.
2022; Shetty et al.
2023; Sun et al.
2024; Xiang et al.
2023] have shown how to drive such avatars from sparse signals like a few stationary and external cameras. Nevertheless, such a hardware setup constrains the person to remain within a fixed and very limited capture volume observed by the external cameras. In contrast, other works focused on egocentric motion capture [Akada et al.
2022; Cha et al.
2018; Kang et al.
2023; Rhodin et al.
2016; Wang et al.
2023a;
2022;
2021;
2023c] using head-mounted cameras while ignoring rendering functionality. However, none of the above works tackles the problem of driving a photoreal full body avatar from single egocentric video. Few works [Elgharib et al.
2020; Wei et al.
2019] focused on driving a head avatar from egocentric video. However, they cannot drive the entire human body, which is significantly more challenging due to body articulation leading to large pose variation, diversity in clothing, and drastically different appearance properties, i.e. clothing material vs. skin reflectance properties. The most closely related work is EgoRenderer [Hu et al.
2021], which is the only one attempting to solve this problem, but their quality is far from photoreal and suffers from severe temporal jitter. In summary, none of the prior works can drive a
photorealistic full-body avatar from a single egocentric RGB camera feed.
To address this, we present
EgoAvatar, the first approach for driving and rendering a photoreal full-body avatar solely from a single egocentric RGB camera. Our proposed system enables operation in unconstrained environments while maintaining high-visual fidelity, thus, paving the way for immersive telepresence without requiring complicated (multi-view) camera setups during inference (see Fig.
1) freeing the user from spatial constraints. Here, the term
full-body refers to modeling the entire clothed human body while we do not account for hand gestures and facial expression.
More precisely, given a multi-view video of the person, we first learn a drivable photorealistic avatar, i.e. a function that takes the skeletal motion in form of joint angles as input and predicts the avatar’s motion- and view-dependent geometry and appearance. First, a geometry module, MotionDeformer, predicts the surface deformation with respect to a static template mesh given the skeletal motion. Subsequently, our appearance module, GaussianPredictor, takes the respective posed and deformed template geometry and learns 3D Gaussian parameters in the texture space parameterized by the template’s UV atlas, where each texel is encoding a 3D Gaussian. During training, we splat the Gaussians into image space and supervise them with the multi-view video.
We then capture a second multi-view video of the subject, this time wearing the headset, from which we train a personalized egocentric view-driven skeleton keypoint detector, dubbed EgoPoseDetector, to extract the subject’s pose in the form of 3D joint predictions. Our inverse kinematics (IK) module, called IKSolver, completes and refines those noisy and incomplete joint predictions to produce temporally stable joint angles of the virtual character. From those, the MotionDeformer provides initial estimate of the deformed surface, which is further refined with the proposed EgoDeformer to ensure a faithful reprojection into the egocentric view. Lastly, the optimized surface is used to predict the final appearance using our GaussianPredictor. In summary, our contributions are threefold:
•
We propose EgoAvatar, the first full-body photoreal and egocentrically driven avatar approach, which, at inference, simply takes as input a monocular video stream from a head-mounted down-facing camera and faithfully re-creates realistic full-body appearance that can be re-rendered in a free-viewpoint video.
•
We further introduce a carefully designed avatar representation, egocentric tracking pipeline, and an egocentric geometry refinement, which in conjunction outperform all design alternatives.
•
We propose the first dataset, which provides paired 120 camera multi-view and monocular egocentric videos capturing the full body human at 4K resolution.
Our experiments demonstrate, both, quantitatively and qualitatively a clear improvement over the state of the art while for the first time demonstrating a unified approach for egocentrically-driven, photoreal, full-body, and free-view human rendering.
2 Related Work
Motion-driven Full-body Avatars. Motion-driven full-body clothed avatars [Bagautdinov et al.
2021; Habermann et al.
2021] represent the human with explicit meshes and learn the dynamic human geometry and appearance from multi-view video. Some works [Liu et al.
2021; Peng et al.
2021; Weng et al.
2022; Zheng et al.
2023] propose SMPL-based [Loper et al.
2015] human priors combined with neural radiance fields [Mildenhall et al.
2020] for human avatar synthesis. However, it is difficult to perform faithful avatar animation and rendering without an explicit modeling of clothing geometry. In consequence, these works typically fail to recover high-frequency cloth details or loose types of apparel. To better capture clothing dynamics, later works [Xiang et al.
2022;
2021] model clothes as a separate mesh layers, but they require a sophisticated clothing registration and segmentation pipeline. In this work, we closely follow DDC [Habermann et al.
2021] for our skeleton motion-dependent geometry representation of the avatar, which consists of a single mesh where deformations are modeled in a coarse-to-fine manner. However, similar to some recent works [Pang et al.
2023], we represent the appearance as Gaussian splats [Kerbl et al.
2023] in contrast to DDC’s dynamic texture maps, which drastically improves the rendering quality. Importantly, all these works solely concentrate on creating an animatible avatar, but they do not target driving it from sparse egocentric observations, which is the focus of our work.
Sparse View-driven Full-body Avatars. Motion-driven avatars can provide visually plausible geometry and rendering quality, but they typically fall short in faithfully re-producing the
actual and true geometry and appearance that are present in the images, which is a critical requirement for virtual telepresence applications. This is primarily caused by the one-to-many mapping [Liu et al.
2021], i.e. one skeletal pose can result in different surface and appearance configurations as they are typically influenced by more than the skeletal pose, e.g. external forces and initial clothing states. Therefore, researchers seek for affordable sensor as additional input signal to generate authentic avatars, e.g. sparse view RGB [Kwon et al.
2021;
2022; Remelli et al.
2022; Shetty et al.
2023] and RGBD [Xiang et al.
2023] driving signals. Without explicit modeling of pose and body geometry, LookinGood [Martin-Brualla et al.
2018] can also re-render human captured from monocular/sparse RGBD sensors. However, such methods rely on stationary multi-camera rigs, which are complicated to setup and heavily constrain the physical capture space. Few works [Elgharib et al.
2020; Jourabloo et al.
2022] have focused on driving avatars from egocentric camera setups, but they solely reconstruct and render the face or head rather than the full human body. Egocentric
full-body avatar synthesis is a significantly more challenging task due to the highly non-rigid surface deformation of clothing, the severe self-occlusion, and the complex material patterns, i.e. cloth vs. skin reflectance. EgoRenderer [Hu et al.
2021] is the only work that attempted to solve this task. However, their results are far from photoreal and contain a observable amount of temporal jitter due to their SMPL-based character representation and inaccurate skeletal pose prediction. In contrast, our hybrid character representation, i.e. deformable mesh and Gaussian splatting, achieves significantly higher rendering quality while our personalized pose detector and inverse kinematics solver also demonstrates drastically improved motion tracking quality.
Egocentric Human Pose and Shape Estimation. Motivated by the current advances in AR/VR, we can divide the egocentric setup into two categories: Front-facing and down-facing camera setups. Front-facing cameras [Li et al.
2023; Luo et al.
2021; Yuan and Kitani
2019] have limited visibility on the human body movement, which contradicts our goal of high-fidelity body and clothing capture. Instead, we follow the down-facing camera setup [Tome et al.
2019; Xu et al.
2019] to recover the full-body egocentric 3D human pose. Despite the direct observation of the human body, conventional pose trackers [Martinez et al.
2017] suffer from severe self-occlusion, unstable camera motion, and fisheye camera distortions. To cope with the aforementioned challenges, researchers [Wang et al.
2021;
2023c] try to localize the world-space pose with the help of SLAM and scene depth estimation as constraints. A recently proposed fisheye vision transformer model [Wang et al.
2023a] with a diffusion-guided pose detector greatly improves the generalization ability and the performance of egocentric pose estimation leveraging large-scale synthetic data. For improved accuracy, we finetune the model of Wang et al. [
2023a] on person-specific egocentric data. We highlight that this model predicts keypoints while not considering free-view rendering and photorealism at all. In contrast, we are interested in driving our photoreal character and propose a dedicated inverse kinematics solver, which recovers joint angles from keypoint estimates.
3 Character Model
We employ a parametric model following DDC [Habermann et al.
2021] to represent the explicit body and clothing shape. For each character, we obtain a template mesh
\(\boldsymbol {T}_0\in \mathbb {R}^{4890\times 3}\), a skeleton
S and skinning weights
\(\mathcal {W}\) from a body scanner. The skeleton
S is controlled by 54 degrees of freedom (DoF)
\(\boldsymbol {\theta }\in \mathbb {R}^{54}\) including joint rotations, global rotations, and translations. The canonical shape
T is parameterized as a non-rigid deformation
T(·) of the 4890 vertices template shape
T0 using a 489 nodes embedded graph [Sumner et al.
2007]. The embedded deformation is parameterized by per-node rotations
\(\boldsymbol {\alpha } \in \mathbb {R}^{489\times 3\times 3}\) and translation
\(\boldsymbol {t} \in \mathbb {R}^{489\times 3}\), and per-vertex offsets
\(\boldsymbol {d}\in \mathbb {R}^{4890\times 3}\). Driven by motion parameters
θ, we first transform the skeleton via Forward Kinematics (FK)
\(\mathcal {J}(\boldsymbol {\theta })\) and then animate the canonical character model
T into posed space
M via the Linear Blend Skinning (LBS) [Magnenat et al.
1988] function
W(·). Formally, our character model is defined as
which allows us to effectively represent surface deformations from coarse-to-fine as well as character skinning.
4 Motion-driven Avatar Geometry Recovery
This section introduces our pipeline (see also Fig.
2) that recovers the explicit full-body avatar model
M from an egocentric RGB video of the real human. To this end, we first predict 3D joint positions (Sec.
4.1) using our personalized
EgoPoseDetector. Then, we introduce an inverse kinematics solver (Sec.
4.2), dubbed
IKSolver, which recovers the DoFs of the skeleton from the 3D joint predictions. Next, we introduce a data-driven
MotionDeformer (Sec.
4.3), which maps skeletal pose to motion-dependent surface deformations effectively capturing details beyond pure skinning-based deformation and, thus, serving as a strong prior. Due to the ambiguity of fine-grained cloth deformation, which is not solely dependent on the skeletal motion, an optimization is derived at test time to further align the avatar’s geometry with egocentric image silhouettes (Sec.
4.4).
4.1 EgoPoseDetector for Pose Detection
Inspired by the current success of learning-based 3D human pose estimation, we adapt the vision transformer-based egocentric pose detector by Wang et al. [
2023a] that is pretrained on a large synthetic dataset. Given an image
I of frame
τ, the neural network localizes 25 body joints
\(\boldsymbol {J}\in \mathbb {R}^{25\times 3}\). To further boost the accuracy for the pose detections, we customize a person-specific detector
by fine-tuning the generic detection network on subject-specific data, i.e. egocentric images and respective keypoints, using our paired egocentric- and multi-view data (see also Sec.
6.1). Notably, the 3D keypoint detections are predicted with respect to the egocentric camera coordinate system, which we then transform to the global coordinate system, denoted as
Jτ, assuming the head-mounted camera can be tracked in 3D space. We highlight that this is a reasonable assumption as recent head-mounted displays offer highly accurate head tracking.
4.2 IKSolver for Skeletal Motion Estimation
With the estimated body joint positions
Jτ, our inverse kinematics solver optimizes the skeletal pose parameters
θτ of frame
τ. Here, we perform a coarse-to-fine optimization, where we first solve for the global rotation and translation, and then jointly refine these parameters with the joint angles. Each stage iteratively minimizes the energy
Concretely, our data term
aligns the skeleton joint
j with their prediction. Further, we introduce some regularizers
to produce temporally smooth motions and plausible joint angles, to account for 1) noisy joint predictions especially for self-occluded parts, and 2) indeterminate DoFs that are not directly supervised through joint positions (
i.e. upper arm rotation). Here,
θmax and
θmin are anatomically inspired joint angle limits that were empirically determined. Lastly, we introduce a simple, yet effective regularization
ensuring that the optimized pose is close to the mean pose
\(\bar{\boldsymbol {\theta }}\) of the training motions. This avoids implausible angle twists, i.e. poses that may coincide with the keypoint detections, but that have implausible angle configurations.
4.3 MotionDeformer for Clothed Avatar Animation
The aim of this stage is to produce the character surface
M of the clothed human avatar from the optimized motion
θτ. Simply posing the template
T0 using LBS is not sufficient to recover dynamic clothing details as skinning mostly models piece-wise rigid deformations. Hence, we aim at modeling the fine-grained clothing deformations conditioned on the normalized motion input
\(\boldsymbol {\hat{\theta }^\tau }=\lbrace \boldsymbol {\theta }^i: i\in \lbrace \tau -2,\ldots ,\tau \rbrace \rbrace\). Particularly, we exploit two structure-aware graph neural networks [Habermann et al.
2021]
to sequentially recover the low frequency embedded graph parameters
α,
t and high frequency vertex offset
d. The network is supervised with rendering, mask, and Chamfer losses against multi-view images, foreground segmentations and multi-view stereo reconstructions. Importantly, we leverage the
unpaired training sequence depicting the subject without the head-mounted camera. With the predicted
α, t, d, we recover the surface geometry of our avatar
Mτ using Eq.
1 for frame
τ. Thus, we have a strong pose-dependent surface deformation prior, which models surface deformations beyond typical skinning that we leverage in the next stage.
4.4 EgoDeformer for Mesh Refinement
While the previous stage produces a plausible surface shape M that preserves the major geometry details, it misses the stochastic motion-independent cloth movements and contains minor motion prediction errors, which results in a misalignment between the mesh projection Π(M) and the egocentric image I. We, thus, employ a test-time optimization to enhance the geometric accuracy. Specifically, we optimize a coordinate-based Multi-Layer Perceptron (MLP) \(\mathcal {F}_{\mathrm{MLP},\boldsymbol {\Psi }}\) to represent the smooth non-rigid clothing deformation, i.e. the final vertex position of deformed mesh M’ are \(\boldsymbol {v}+\mathcal {F}_{\mathrm{MLP},\boldsymbol {\Psi }}(\boldsymbol {v})\), where \(\boldsymbol {v}\in \mathbb {R}^3\) is the vertex coordinate of M.
We optimize the MLP weights
Ψ using
where
describes a silhouette loss that minimizes the disparity of the projection
Π(·) of the deformed mesh
M′ against egocentric silhouette images
ISil segmented using [Kirillov et al.
2023]. However, recovering pixel-perfect geometry from single view is highly ill-posed, especially due to rich wrinkles. Therefore, we do not explicitly reconstruct wrinkles with photo-metric energy in the test time, but rather learn to generate plausible wrinkles as dynamic Gaussian splats. We further introduce a Laplacian smoothness term
ELap [Desbrun et al.
1999] and part-based as-rigid-as-possible term
EArap [Sorkine and Alexa
2007] to regularize the deformation. To further increase temporal consistency, we apply a low pass filter as post-processing step. For further details, we refer to the supplemental material.
5 Gaussian-based dynamic appearance
So far, we were only concerned with recovering the geometry of the human. In this section, we introduce our approach to render high-resolution dynamic appearance of the 3D avatar by fusing pre-trained body textures with egocentric observations. Given a tracked mesh, deep textures [Habermann et al.
2021; Lombardi et al.
2018] successfully model the dynamic textures and shading by predicting a motion and view-dependent UV texture. However, mesh-based rendering considers only the first intersection of each ray over the triangulated surface, therefore leading to artifacts in thin structures, e.g. hair, and for regions where mesh tracking is often inaccurate, e.g. hands. Instead, mesh-driven volume rendering [Lombardi et al.
2021] provides extra flexibility in compensating for mesh tracking errors and modeling complex surface deformations. Motivated by recent advances of 3D Gaussian Splatting (3DGS) [Kerbl et al.
2023], we leverage 3D Gaussian spheres on top of the mesh surface as primitives, potentially enabling high-resolution rendering.
3DGS represents the 3D object as a collection of Gaussian spheres and renders them in a fully differentiable manner given a virtual camera view. Specifically, each Gaussian sphere is parameterized by a centroid
\(\boldsymbol {x}\in \mathbb {R}^3\), a color
\(\boldsymbol {c}\in \mathbb {R}^{48}\) in the form of 3-order Spherical Harmonics coefficients, a Quaternion rotation
\(\boldsymbol {\phi }\in \mathbb {R}^4\), a scaling
\(\boldsymbol {s}\in \mathbb {R}^3\), and an opacity
\(\boldsymbol {o}\in \mathbb {R}\). We formulate the final mesh-driven 3D Gaussians as
where the covariance matrix
Σ is parameterized by the predicted scaling
s and rotation
ϕ. 3D Gaussian splats are rendered into a camera view following
Each pixel
C blends
N rasterized Gaussian projections. Here, the color and opacity of each Gaussian per pixel are
\(\boldsymbol {c_i^{\prime }}=G(\boldsymbol {x^{\prime };\boldsymbol {x_i},\boldsymbol {\Sigma _i}})\boldsymbol {c_i}\) and
\(\boldsymbol {o_i^{\prime }}=G(\boldsymbol {x^{\prime };\boldsymbol {x_i},\boldsymbol {\Sigma _i}})\boldsymbol {o_i}\).
Dynamic Appearance Modeling. With the recovery of the motion-driven avatar shape, we uniformly sample 3D Gaussians in the UV space of the underlying character mesh, i.e. each texel that is covered by a triangle in the UV map represents a Gaussian. In particular, we first obtain the 3D position, rotation and normal of the deformed mesh
M in the training time and
M′ in the test time. Then, for each texel
i, we use barycentric interpolation to fetch the initial position
\(\hat{\boldsymbol {x}}_{i}\), rotation
\(\hat{\boldsymbol {\phi }}_{i}\) and normal
\(\hat{\boldsymbol {n}}_i\), respectively. Inspired by the dynamic texture module in DDC, we leverage a UNet [Ronneberger et al.
2015]
to predict Gaussian parameters and offsets in UV space, taking as input the global encoding of the mesh geometry (namely, normal map
\(\hat{\boldsymbol {n}}_i\)) and the skeleton root position
x0. Given the differentiable rendering equation
R as described in Eq.
12 and
13, we produce the final rendering
Ir as
where < ·, · > denotes the Quaternion multiplication operation.
Training Losses. We supervise our model with multi-view 4K images using L1 and SSIM [Wang et al.
2004] losses. Following Xiang et al. [
2023], we also leverage the ID-MRF [Wang et al.
2018] loss to ensure perceptually realistic renderings. Our loss reads as
Training Scheme. We separate Gaussians for the head and body region. All Gaussians are jointly trained on the sequence without the head-mounted device. Then, the body Gaussians are fine-tuned on the additional sequence with the head-mounted device. At inference, we assemble the two parts of Gaussian splats for full-body rendering.