Abstract
We introduce the first dense neural non-rigid structure from motion (N-NRSfM) approach, which can be trained end-to-end in an unsupervised manner from 2D point tracks. Compared to the competing methods, our combination of loss functions is fully-differentiable and can be readily integrated into deep-learning systems. We formulate the deformation model by an auto-decoder and impose subspace constraints on the recovered latent space function in a frequency domain. Thanks to the state recurrence cue, we classify the reconstructed non-rigid surfaces based on their similarity and recover the period of the input sequence. Our N-NRSfM approach achieves competitive accuracy on widely-used benchmark sequences and high visual quality on various real videos. Apart from being a standalone technique, our method enables multiple applications including shape compression, completion and interpolation, among others. Combined with an encoder trained directly on 2D images, we perform scenario-specific monocular 3D shape reconstruction at interactive frame rates. To facilitate the reproducibility of the results and boost the new research direction, we open-source our code and provide trained models for research purposes (http://gvv.mpi-inf.mpg.de/projects/Neural_NRSfM/).
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Neural Non-Rigid Structure from Motion
- Sequence period detection
- Latent space constraints
- Deformation auto-decoder
1 Introduction
Non-Rigid Structure from Motion (NRSfM) reconstructs non-rigid surfaces and camera poses from monocular image sequences using multi-frame 2D correspondences calculated across the input views. It relies on motion and deformation cues as well as weak prior assumptions, and is object-class-independent in contrast to monocular 3D reconstruction methods which make use of parametric models [59]. Dense NRSfM has achieved remarkable progress during the last several years [1, 8, 19, 37, 51]. While the accuracy of dense NRSfM has been recently only marginally improved, learning-based direct methods for monocular rigid and non-rigid 3D reconstruction have become an active research area in computer vision [13, 33, 47, 54, 66].
Motivated by these advances, we make the first step towards learning-based dense NRSfM, as it can be seen in Fig. 1. At the same time, we remain in the classical NRSfM setting without strong priors (which restrict to object-specific scenarios) or assuming the availability of training data with 3D geometry. We find that among several algorithmic design choices, replacing an explicit deformation model by an implicit one, i.e., a neural network with latent variables for each shape, brings multiple advantages and enables new applications compared to the previous work such as temporal state segmentation, shape completion, interpolation and direct monocular non-rigid 3D reconstruction (see Fig. 1-b/ for some examples).
By varying the number of parameters in our neural component, we can express our assumption on the complexity of the observed deformations. We observe that most real-world deformations evince state recurrence which can serve as an additional reconstruction constraint. By imposing constraints on the latent space, we can thus detect a period of the sequence, denoted by \(\tau \), i.e., the duration in frames after which the underlying non-rigid 3D states repeat, and classify the recovered 3D states based on their similarity. Next, by attaching an image encoder to the learnt neural deformation model (deformation auto-decoder), we can perform in testing direct monocular non-rigid 3D reconstruction at interactive frame rates. Moreover, an auto-decoder represents non-rigid states in a compressed form due to its compactness.
Note that the vast majority of the energy functions proposed in the literature so far is not fully differentiable or cannot be easily used in learning-based systems due to computational or memory requirements [1, 8, 19, 37]. We combine a data loss, along with constraints in the metric and trajectory spaces, a temporal smoothness loss as well as latent space constraints into single energy—with the non-rigid shape parametrised by an auto-decoder—and optimise it with the back-propagation algorithm [49]. The experimental evaluation indicates that the proposed N-NRSfM approach obtains competitive solutions in terms of 3D reconstruction, and outperforms competing methods on several sequences, but also represents a useful tool for non-rigid shape analysis and processing.
Contributions. In summary, the primary contributions of this work are:
-
\(\star \) The first, to the best of our belief, fully differentiable dense neural NRSfM approach with a novel auto-decoder-based deformation model (Sects. 3, 4);
-
\(\star \) Subspace constraints on the latent space imposed in the Fourier domain. They enhance the reconstruction accuracy and enable temporal classification of the recovered non-rigid 3D states with period detection (Sect. 4.2);
-
\(\star \) Several applications of the deformation model including shape compression, interpolation and completion, as well as fast direct non-rigid 3D reconstruction from monocular image sequences (Sect. 4.4);
-
\(\star \) An extensive experimental evaluation of the core N-NRSfM technique and its applications with state-of-the-art results (Sect. 5).
2 Related Work
Recovering a non-rigid 3D shape from a single monocular camera has been an active research area in the past two decades. In the literature, two main classes of approaches have proved most effective so far: template-based formulations and NRSfM. On the one hand, template-based approaches relied on establishing correspondences with a reference image in which the 3D shape is already known in advance [42, 53]. To avoid ambiguities, additional constraints were included in the optimisation, such as the inextensibility [42, 65], as rigid as possible priors [68], providing very robust solutions but limiting its applicability to almost inelastic surfaces. While the results provided by template-based approaches are promising, knowing a 3D template in advance can become a hard requirement. In order to avoid that, NRSfM approaches have reduced these requirements, making their applicability easier. In this context, NRSfM has been addressed in the literature by means of model-based approaches, and more recently, by the use of deep-learning-based methods. We next review the most related work to solve this problem by considering both perspectives.
Non-Rigid Structure from Motion. NRSfM has been proposed to solve the problem from 2D tracking data in a monocular video (in the literature, 2D trajectories are collected in a measurement matrix). The most standard approach to address the inherent ambiguity of the NRSfM problem is by assuming the underlying 3D shape is low-rank. In order to estimate such low-rank model, both factorisation- [11] and optimisation-based approaches [43, 61] have been proposed, considering single low-dimensional shape spaces [16, 19], or a union of temporal [69] or spatio-temporal subspaces [3]. Low-rank models were also extended to the other domains, by exploiting pre-defined trajectory basis [7], the combination of shape-trajectory vectors [28, 29], and the force space that induces the deformations [5]. On top of these models, additional spatial [38] or temporal [2, 10, 39] smoothness constraints, as well as shape priors [12, 21, 35] have also been considered. However, in contrast to their rigid counterparts, NRSfM methods are typically sparse, limiting their application to a small set of salient points. Whereas several methods are adaptations of sparse techniques to dense data [22, 51], other techniques were explicitly designed for the dense setting [1, 19, 37] relying on sophisticated optimisation strategies.
Neural Monocular Non-Rigid 3D Reconstruction. Another possibility to perform monocular non-rigid 3D reconstruction is to use learning-based approaches. Recently, many works have been presented for rigid [13, 18, 30, 40, 66] and non-rigid [27, 47, 54, 62] shape reconstruction. These methods exploited a large and annotated dataset to learn the solution space, limiting their applicability to the type of shapes that are observed in the dataset. Unfortunately, this supervision is a hard task to be handled in real applications, where the acquisition of 3D data to train a neural network is not trivial.
While there has been work at the intersection of NRSfM and deep learning, the methods require large training datasets [34, 41, 52] and address only the sparse case [34, 41]. C3DPO [41] learns basis shapes from 2D observations and does not require 3D supervision, similar to our approach. Neural methods for monocular non-rigid reconstruction have to be trained for every new object class or shape configuration within the class. In contrast to the latter methods—and similar to the classical NRSfM—we solely rely on motion and deformation cues. Our approach is unsupervised and requires only dense 2D point tracks for the recovery of non-rigid shapes. Thus, we combine the best of both worlds, i.e., the expressivity of neural representations for deformation models and improvements upon weak prior assumptions elaborated in previous works on dense NRSfM. We leverage the latter in the way so that we find an energy function which is fully differentiable and can be optimised with modern machine-learning tools.
3 Revisiting NRSfM
We next review the NRSfM formulation that will be used later to describe our neural approach. Let us consider a set of P points densely tracked across T frames. Let \(\mathbf{s}_t^p=[x_t^p,y_t^p,z_t^p]^\top \) be the 3D coordinates of the p-th point in image t, and \(\hat{\mathbf{w}}_t^p=[u_t^p,v_t^p]^\top \) its 2D position according to an orthographic projection. In order to simplify subsequent formulation, the camera translation \(\mathbf {t}_t=\sum _p\hat{\mathbf{w}}_t^p/P\) can be subtracted from the 2D projections, considering centred measurements as \(\mathbf{w}_t^p=\hat{\mathbf{w}}_t^p-\mathbf {t}_t\). We can then build a linear system to map the 3D-to-2D point coordinates as:
where \(\mathbf{W}\) is a \(2T\times P\) measurement matrix with the 2D measurements arranged in columns, \(\mathbf {R}\) is a \(2T\times 3T\) block diagonal matrix made of T truncated \(2\times 3\) camera rotations \(\mathbf {R}_t\equiv \mathbf{\Pi }\mathbf {G}_t\) with the full rotation matrix \(\mathbf {G}_t\) and \(\mathbf{\Pi }= \begin{bmatrix} 1 0 0\\ 0 1 0\end{bmatrix}\); and S is a \(3T\times P\) matrix with the non-rigid 3D shapes. Every \(\mathbf {G}_t\) lies in the SO(3) group, that we enforce using an axis-angle representation encoding the rotation by a vector \(\varvec{\alpha }_t = (\alpha ^x_t, \alpha ^y_t, \alpha ^z_t)\), that can be related to \(\mathbf {G}_t\) by the Rodrigues’ rotation formula. On balance, the problem consists in estimating the time-varying 3D shape \(\mathbf{S}^t\) as well as the camera motion \(\mathbf {G}^t\) with \(t=\{1,\ldots ,T\}\), from 2D trajectories W.
4 Deformation Model with Shape Auto-Decoder
In the case of dynamic objects, the 3D shape changes as a function of time. Usually, this function is unknown, and many efforts have been made to model it. The type of deformation model largely determines which observed non-rigid states can be accurately reconstructed, i.e., the goal is to find a simple model with large expressibility. In this context, perhaps the most used model in the literature consists in enforcing the deformation shape to lie in a linear subspace [11]. While this model has been proved to be effective, the form in which the shape bases are estimated can be decisive. For example, it is well known that some constraints cannot be effectively imposed in factorisation methods [11, 67], forcing the proposal of more sophisticated optimisation approaches [3, 16, 69]. In this paper, we propose to depart from the traditional formulations based on linear subspace models and embrace a different formulation that can regress the deformation modes in a unsupervised manner during a neural network training, see Fig. 2 for a method overview. By controlling the architecture and composition of the layers, we can express our assumptions about the complexity and type of the observed deformations. We will use the name of Neural Non-Rigid Structure from Motion (N-NRSfM) to denote our approach.
4.1 Modelling Deformation with Neural Networks
We propose to implement our non-rigid model network as a deformation auto-decoder \(f_{\varvec{\theta }}\), as it was done for rigid shape categories [44], where \(\varvec{\theta }\) denotes the learned network parameters. Specifically, we construct \(f_{\varvec{\theta }}\) as a series of nine fully-connected layers with small hidden dimensions \((2,8,8,8,16,32,32,B,|\mathbf{S}_t|)\), and exponential linear unit (ELU) activations [14] (except after the penultimate and final layers). B—set to 32 by default—can be interpreted as an analogue to the number of basis shapes in linear subspace models. \(f_{\varvec{\theta }}\) is a function of the latent space \(\mathbf{z}_t\), that is related to the shape space \(\mathbf{S}_t\) by means of:
where \(\bar{\mathbf{S}}\) is a \(3 \times P\) mean shape matrix. We can also obtain the time-varying shape \(\mathbf{S}\) in Eq. (1) by \(\mathbf{S} =(\mathbf {1}_{T}\otimes \bar{\mathbf{S}})+ f_{\varvec{\theta }}(\mathbf{z})\), with \(\mathbf {1}_T\) a T-dimensional vector of ones and \(\otimes \) a Kronecker product. The fully-connected layers of \(f_{\varvec{\theta }}\) are initialised using He initialisation [31], and the bias value of the last layer is set to a rigid shape estimate \(\bar{\mathbf{S}}\), which is kept fixed during optimisation. Both \(\bar{\mathbf{S}}\) and \(\mathbf {R}_t\) with \(t=\{1,\ldots ,T\}\) are initialised by rigid factorisation [60] from W. Note that we estimate displacements (coded by \(f_{\varvec{\theta }}(\mathbf{z}_t)\)) from \(\bar{\mathbf{S}}\) instead of absolute point positions. Considering that, the weight matrix of the final fully-connected layer of \(f_{\varvec{\theta }}\) can be interpreted as a low-rank linear subspace where every vector denotes a 3D displacement from the mean shape. This contributes to the compactness of the recovered space and serves as an additional constraint, similar to the common practice of the principal component analysis [46].
To learn \(\varvec{\theta }\), and update it during training, we require gradients with respect to a full energy \(\mathbf {E}\) that we will propose later, such that:
connecting \(f_{\varvec{\theta }}\) into a fully-differentiable loss function, in which \(\mathbf{S}_t\), \(t=\{1,\ldots ,T\}\) are optimised as free variables via gradients. We next describe our novel energy function \(\mathbf {E}\), which is compatible with \(f_{\varvec{\theta }}\) and supports gradient back-propagation.
4.2 Differentiable Energy Function
To solve the NRSfM problem as it was defined in Sect. 3, we propose to minimise a differentiable energy function with respect to motion parameters \(\mathbf {R}\) and shape ones (coded by \(\varvec{\theta }\) and \(\mathbf{z}\)) as:
where \(\mathbf {E}_{\text {data}}\) is a data term, and \(\{\mathbf {E}_{\text {temp}},\mathbf {E}_{\text {spat}},\mathbf {E}_{\text {traj}},\mathbf {E}_{\text {latent}}\}\) encode the priors that we consider. \(\beta \), \(\gamma \), \(\eta \) and \(\omega \) are weight coefficients to balance the influence of every term. We now describe each of these terms in detail.
The data term \(\mathbf {E}_{\text {data}}\) is derived from the projection equation (1), and it is to penalise the image re-projection errors as:
where \(\left\Vert \cdot \right\Vert _\epsilon \) denotes the Huber loss of a matrix.
The temporal smoothness term \(\mathbf {E}_{\text {temp}}\) enforces temporal-preserving regularisation of the 3D shape via its latent space as:
Thanks to this soft-constraint prior, our algorithm can generate clean surfaces that also stabilise the camera motion estimation.
The spatial smoothness term \(\mathbf {E}_{\text {spat}}\) imposes spatial-preserving regularisation for a neighbourhood. This is especially relevant for dense observations, where most of the points in a local neighbourhood can follow a similar motion pattern. To define this constraint, let \(\mathcal {N}(\mathbf{p})\) be a 1-ring neighbourhood of \(\mathbf {p}\in \mathbf{S}_t\), that will be used to define a Laplacian term (widely used in computer graphics [55]). For robustness, we complete the spatial smoothness with a depth penalty term. Combining both ideas, we define this term as:
where \(\mathcal {P}_z\) denotes an operator to extract z-coordinates, \(\left\Vert \cdot \right\Vert _1\) and \(\left\Vert \cdot \right\Vert _2\) are the \(l_1\)- and \(l_2\)-norm, respectively, and \(\lambda >0\) is a weight coefficient. Thanks to the depth term, our N-NRSfM approach automatically achieves more supervision over the z-coordinate of the 3D shapes, since it can lead to an increase in the shape extent along the z-axis.
The point trajectory term \(\mathbf {E}_{\text {traj}}\) imposes a subspace constraint on point trajectories throughout the whole sequence, as it was exploited by [6, 7]. To this end, the 3D point trajectories are coded by a linear combination of K fixed trajectory vectors by a \(T \times K\) matrix \(\mathbf{\Phi }\) together with a \(3K \times P\) matrix \(\mathbf {A}\) of unknown coefficients. The penalty term can be then written as:
where \(\phi _{t,k} = \frac{\sigma _k}{\sqrt{2}} \cos \big ( \frac{\pi }{2T}(2t - 1)(k - 1)\big )\), with \(\sigma _k = 1\) for \(k = 1\), and \(\sigma _k = \sqrt{2}\), otherwise. \(\mathbf{I}_3\) is a \(3 \times 3\) identity matrix. We experimentally find that this term is not redundant with the rest of terms, and it provides a soft regularisation of \(f_{\varvec{\theta }}\).
Finally, the latent term \(\mathbf {E}_{\text {latent}}\) imposes sparsity constraints over the latent vector z. This type of regularisation is enabled by the new form to express the deformation model with an auto-decoder \(f_{\varvec{\theta }}\), and it can be expressed as:
where \(\mathcal {F}(\cdot )\) denotes the Fourier transform (FT) operator. Thanks to this penalty term, we can impose several effects which were previously not possible. First, \(\mathbf{E}_{\text {latent}}\) imposes structure on the latent space by encouraging the sparsity of the Fourier series and removing less relevant frequency components. In other words, this can be interpreted as subspace constraints on the trajectory of the latent space variable, where the basis trajectories are periodic functions. Second, by analysing the structured latent space, we can extract the period of a periodic sequence and temporally segment the shapes according to their similarity. Our motivation for \(\mathbf{E}_{\text {latent}}\) is manifold and partially comes from the observation that many real-world scenes evince recurrence, i.e., they repeat their non-rigid states either in periodic or non-periodic manner.
Period Detection and Sequence Segmentation. The period of the sequence can be recovered from the estimated \(\mathcal {F}(\mathbf{z})\), by extracting the dominant frequency in terms of energy within the frequency spectrum. If a dominant frequency \(\omega _{d}\) is identified, its period can be directly computed as \(\tau = \frac{T}{\omega _d}\). Unfortunately, in some real scenarios, the frequency spectrum that we obtain may not be unimodal (two or more relevant peaks can be observed in the spectrum), and therefore we obtain \(\tau = T\). Irrespective whether a sequence is periodic or not, the latent space is temporally segmented so that similar values are decoded into similar shapes. This enables applications such as shape interpolation, completing and denoising.
4.3 Implementation Details
The proposed energy in Eq. (4) and the deformation auto-decoder \(f_{\varvec{\theta }}\) are fully-differentiable by construction, and therefore the gradients that flow into \(\mathbf{S}_t\) can be further back-propagated into \(\varvec{\theta }\). Our deformation model is trained to simultaneously recover the motion parameters \(\mathbf {R}\), the latent space z to encode shape deformations, and the model parameters \(\varvec{\theta }\). Additionally, the trajectory coefficients in \(\mathbf {A}\) are also learned in this manner (see Eq. (8)). For initialisation, we use rigid factorisation to obtain \(\mathbf {R}\) and \(\bar{\mathbf{S}}\), random values in the interval \([-1, 1]\) for z, and a null matrix for \(\mathbf {A}\). The weights \(\beta , \gamma , \eta , \omega , \lambda \) are determined empirically and selected from the determined ranges in most experiments we describe in Sect. 5, unless mentioned otherwise. The values we set are \(10^2\) for \(\mathbf {E}_{\text {data}}\), \(\beta =1\), \(\gamma \in [10^{-6}; 10^{-4}\)], \(\eta \in [1; 10]\), \(\omega =1\), \(\lambda \in [0; 10^{-3}]\) and \(B = 32\) in \(f_{\varvec{\theta }}\). In addition, we use \(K=7\) as default value to define our low-rank trajectory model in Eq. (8).
Our N-NRSfM approach is implemented in pytorch [45]. As all the training data are available at the same time, we use the RProp optimiser [48] with a learning rate of 0.0001, and train for 60, 000 epochs. All experiments are performed on NVIDIA Tesla V100 and K80 GPUs with a Debian 9 Operating System. Depending on the size of the dataset, training takes between three (e.g., the back sequence [50]) and twelve (the barn-owl sequence [26]) hours on our hardware.
4.4 Applications of the Deformation Auto-Decoder \(f_\theta \)
Our deformation auto-decoder \(f_{\varvec{\theta }}\) can be used for several applications which were not easily possible in the context of NRSfM before, including shape denoising, shape completion and interpolation as well as correspondence-free monocular 3D reconstruction of non-rigid surfaces with reoccurring deformations.
Shape Compression, Interpolation, Denoising and Completion. The trained \(f_{\varvec{\theta }}\) combined with \(\bar{\mathbf{S}}\) represents a compressed version of a 4D reconstruction and requires much less memory compared to the uncompressed shapes in the explicit representation \(\mathbf{S}_t\) with \(t=\{1,\ldots ,T\}\). The number of parameters required to capture all 3D deformations accurately depends on the complexity of the observed deformations, and not on the length of a sequence. Thus, the longer a sequence with repetitive states is, the higher is the compression ratio c. Next, let us suppose we are given a partial and noisy shape \(\tilde{\mathbf{S}}\), and we would like to obtain a complete and smooth version of it \(\mathbf{S}_{\varvec{\theta }}\) upon the learned deformation model prior. We use our pre-trained auto-decoder and optimise for the latent code z, using the per-vertex error as the loss. In the case of a partial shape, the unknown vertices are assumed to have some dummy values. Moreover, since the learned latent space is smooth and statistically assigns similar variables to similar shapes (displacements), we can interpolate the latent variables which will result in the smooth interpolation of the shapes (displacements).
Direct Monocular Non-rigid 3D Reconstruction with Occlusion Handling. Pre-trained \(f_{\varvec{\theta }}\) can also be combined with other machine-learning components. We are interested in direct monocular non-rigid 3D reconstruction for endoscopic scenarios (though N-NRSfM is not restricted to those). Therefore, we train an image encoder which relates images to the resulting latent space of shapes (after the N-NRSfM training). Such image-to-mesh encoder-decoder is also robust against moderate partial scene occlusions—which frequently occur is endoscopic scenarios—as the deformations model \(f_{\varvec{\theta }}\) can also rely on partial observations. We build the image encoder based on ResNet-50 [32] pre-trained on the ImageNet [17] dataset.
At test time, we can reconstruct a surface from a single image, assuming state recurrence. Since the latent space is structured, we are modelling in-between states obtained by interpolation of the observed surfaces. This contrasts to the DSPR method [25], which de facto allows only state re-identification. Next, with the gradual degradation of the views, the accuracy of our image-to-surface reconstructor degrades gracefully. We can feed images with occlusions or a constant camera pose bias—such as those observed by changing from the left to the right camera in stereo recordings—and still expect accurate reconstructions.
5 Experiments
In this section, we describe the experimental results. We first compare our N-NRSfM approach to competing approaches on several widely-used benchmarks and real datasets following the established evaluation methodology for NRSfM (Sect. 5.1). We next evaluate how accurately our method detects the periods and how well it segments sequences with non-periodic state recurrence (Sect. 5.2). For the sequences with 3D ground truth geometry \(\mathbf{S}^{\text {GT}}\), we report the 3D error \(e_{3D}\)—after shape-wise orthogonal Procrustes alignment—defined as \(e_{3D} = \frac{1}{T} \sum _t \frac{ \left\Vert \mathbf{S}_t^{\text {GT}} - \mathbf{S}_t \right\Vert _\mathcal {F} }{ \left\Vert \mathbf{S}_t^{\text {GT}}\right\Vert _\mathcal {F} }\), where \(\left\Vert \cdot \right\Vert _\mathcal {F}\) denoted Frobenius norm. Note that \(e_{3D}\) also implicitly evaluates the accuracy of \(\mathbf {R}_t\) because of the mutual dependence between \(\mathbf {R}_t\) and \(\mathbf{S}_t\). Finally, for periodic sequences, we compare the estimated pulse \(\tau \) with the known one \(\tau ^{\text {GT}}\).
5.1 Quantitative Comparisons
We use three benchmark datasets in the quantitative comparison: synthetic faces (two sequences with 99 frames and two different camera trajectories denoted by traj. A and traj. B, with 28, 000 points per frame) [19], expressions (384 frames with 997 points per frame) [4], and Kinect t-shirt (313 frames with 77, 000 points) and paper (193 frames with 58, 000 points) sequences taken from [64]. In the case if 3D ground truth shapes are available, ground truth dense point tracks are obtained by a virtual orthographic camera. Otherwise, dense correspondences are calculated by multi-frame optical flow [20, 57].
Synthetic Faces. \(e_{3D}\) for the synthetic faces are reported in Table 1. We compare our N-NRSfM to Metric Projections (MP) [43], Trajectory Basis (TB) approach [7], Variational Approach (VA) [19], Dense Spatio-Temporal Approach (DSTA) [15], Coherent Depth Fields (CDF) [23], Consolidating Monocular Dynamic Reconstruction (CMDR) [24, 25], Grassmannian Manifold (GM) [37], Jum- ping Manifolds (JM) [36], Scalable Monocular Surface Reconstruction (SMSR) [8], Expectation-Maximisation Finite Element Method (EM-FEM) [1] and Probabilistic Point Trajectory Approach (PPTA) [6]. Our N-NRSfM comes close to the most accurate methods on traj. A and comes in the middle on traj. B among all methods. Note that GM and JM use Procrustes alignment with scaling, which results in the comparison having slightly differing metrics. Still, we include these methods for completeness. Traj. B is reportedly more challenging compared to traj. A for all tested methods which we also confirm in our runs. We observed that without the depth control term in Eq. (7), the \(e_{3D}\) on traj. B was higher by \({\sim }30\%\). Figure 4-(a) displays the effect of Eq. (7) on the 3D reconstructions from real images, when the dense point tracks and initialisations can be noisy.
Expressions. The usage of expressions allows us to compare N-NRSfM to even more methods from the literature including Expectation-Maximisation Linear Dynamical System (EM-LDS) [61], Column Space Fitting, version 2 (CSF2) [29], Kernel Shape Trajectory Approach (KSTA) [28] and Global Model with Local Interpretation (GMLI) [4]. The results are summarised in Table 2. We achieve \(e_{3D} = 0.026\) on par with GMLI, i.e., currently the best method on this sequence. The complexity of facial deformations in the expressions is similar to those of the synthetic faces [19]. This experiment shows that our novel neural model for NRSfM with constraints in metric and trajectory space is superior to multiple older NRSfM methods.
Kinect Sequences. For a fair evaluation, we pre-process the Kinect t-shirt and paper sequences along with their respective reference depth measurements as described in Kumar et al. [37]. As it is suggested there, we run multi-frame optical flow [20] with default parameters to obtain dense correspondences. \(e_{3D}\) for the Kinect sequences are listed in Table 3. Visualisations of selected reconstructions of Kinect sequences can be found in Fig. 6-(top row). On Kinect paper and t-shirt sequences, we outperform all competing methods, including the current state of the art by significant margins of \(1\%\) and \(20\%\), respectively. These sequences evince larger deformations compared to the face sequence, and, on the other hand, a simpler camera trajectory.
5.2 Period Detection and Sequence Segmentation
We evaluate the capability of our N-NRSfM method in period detection and sequence segmentation on the actor mocap sequence (100 frames with \(3.5 \cdot 10^4\) points per frame) [25, 63]. It has highly deformed facial expressions with ground truth shapes, ground truth dense flow fields and rendered images under orthographic projection. We duplicate the sequence and run N-NRSfM on the obtained point tracks. Our approach reconstructs the entire sequence and returns the frequency equal to 2, as can be seen in the Fourier spectrum. Given 200 input frames, it implies a period of 100. The latent space function for this experiment and the evolution of the latent space function are shown in Fig. 3. Note that for the same shapes, the resulting latent variables are also similar. This confirms that our N-NRSfM segments the sequence based on the shape similarity.
Next, we confirm that the period detection works well on real heart bypass surgery sequence [56] with 201 frames and 68, 000 point per frame (see Fig. 6-(bottom right) for the exemplary frames and our reconstructions). This sequence evinces natural periodicity, and the flow fields are computed individually for every frame without duplication. We emphasise that images do not repeat as—even though the states are recurrent—they are observed under varying illumination and different occlusions. We recover the dominant frequency of 7.035, whereas the observed number of heartbeats amounts to \({\sim }7.2\). Knowing that the video was recorded at 24 frames per second, we obtain the pulse \(\tau \) of \(\tau = 7.035 \text {beats} \cdot \frac{24 \text {fps}}{201 \text {frames}} = 0.84\) beats per second or \({\sim }50\) beats per minute—which is in the expected pulse range of a human during bypass surgery.
5.3 Qualitative Results and Applications
The actor mocap sequence allows us to qualitatively compare N-NRSfM to a state-of-the-art method for monocular 3D face reconstruction. Thus, we run the Face Model Learning (FML) approach of Tewari et al. [58] on it and show qualitative results in Fig. 4-(b). We observe that it is difficult to recognise the person in the FML 3D estimates (\(e_{3D} = 0.092\) after Procrustes alignment of the ground truth shapes and FML reconstructions with re-scaling of the latter). Since FML runs per-frame, its 3D shapes evince variation going beyond changing facial expressions, i.e., it changes the identity. In contrast, N-NRSfM produces recognizable and consistent shapes at the cost of accurate dense correspondences across an image batch (\(e_{3D} = 0.0181\), \({\sim }5\) times lower compared to \(e_{3D} = 0.092\) of FML).
Our auto-decoder \(f_{\varvec{\theta }}\) is a flexible building block which can be used in multiple applications which were not easily possible with classical NRSfM methods. Those include shape completion, denoising, compression and interpolation, fast direct monocular non-rigid 3D reconstruction as well as sequence segmentation.
Shape Interpolation and Completion. To obtain shape interpolations, we can linearly interpolate the latent variables, see Fig. 5-(top row) for an example with the actor mocap reconstructions. Note that the interpolation result depends on the shape order in the latent space. For shape with significantly differing latent variables, it is possible that the resulting interpolations will not be equivalent to linear interpolations between the shapes and include non-linear point trajectories. Results of shape denoising and completion are shown in Fig. 5-(bottom rows). We feed point clouds with missing areas (mouth and the upper head area) and obtain surfaces completed upon our learned \(f_{\varvec{\theta }}\) prior.
Direct Monocular Non-rigid 3D Reconstruction. We attach an image encoder to \(f_{\varvec{\theta }}\)—as described in Sect. 4.4—and test it in the endoscopic scenario with the heart sequence. Our reconstructions follow the cardiac cycle outside of the image sub-sequence, which has been used for the training. Please, see our supplemental material for extra visualisations.
Real Image Sequences. Finally, we reconstruct several real image sequence, i.e., barn owl [26], back [50] (see Fig. 6) and real face [19] (see Fig. 4-(a) which also highlights the influence of the spatial smoothness term). All our reconstructions are of high visual quality and match state of the art. Please, see our supplementary video for time-varying visualisations.
6 Concluding Remarks
This paper introduces the first end-to-end trainable neural dense NRSfM method with a deformation model auto-decoder and learnable latent space function. Our approach operates on dense 2D point tracks without 3D supervision. Structuring the latent space to detect and exploit periodicity is a promising first step towards new regularisation techniques for NRSfM. Period detection and temporal segmentation of the reconstructed sequences, automatically learned deformation model, shape compression, completion and interpolation—all that is obtained with a single neural component in our formulation. Experiments have shown that the new model results in smooth and accurate surfaces while achieving low 3D reconstruction errors in a variety of scenarios. One of the limitations of N-NRSfM is the sensitivity to inaccurate points tracks and the dependence on the mean shape obtained by rigid initialisation. We also found that our method does not cope well with large and sudden changes, even though the mean shape is plausible. Another limitation is the handling of articulated motions.
We believe that our work opens a new perspective on dense NRSfM. In future research, more sophisticated neural components for deformation models can be tested to support stronger non-linear deformations and composite scenes. Moreover, we plan to generalise our model to sequential NRSfM scenarios.
References
Agudo, A., Montiel, J.M.M., Agapito, L., Calvo, B.: Online dense non-rigid 3D shape and camera motion recovery. In: British Machine Vision Conference (BMVC) (2014)
Agudo, A., Montiel, J.M.M., Calvo, B., Moreno-Noguer, F.: Mode-shape interpretation: re-thinking modal space for recovering deformable shapes. In: Winter Conference on Applications of Computer Vision (WACV) (2016)
Agudo, A., Moreno-Noguer, F.: DUST: dual union of spatio-temporal subspaces for monocular multiple object 3D reconstruction. In: Computer Vision and Pattern Recognition (CVPR) (2017)
Agudo, A., Moreno-Noguer, F.: Global model with local interpretation for dynamic shape reconstruction. In: Winter Conference on Applications of Computer Vision (WACV) (2017)
Agudo, A., Moreno-Noguer, F.: Force-based representation for non-rigid shape and elastic model estimation. Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(9), 2137–2150 (2018)
Agudo, A., Moreno-Noguer, F.: A scalable, efficient, and accurate solution to non-rigid structure from motion. Comput. Vis. Image Underst. (CVIU) 167, 121–133 (2018)
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: a dual representation for nonrigid structure from motion. Trans. Pattern Anal. Mach. Intell. (TPAMI) 33(7), 1442–1456 (2011)
Ansari, M., Golyanik, V., Stricker, D.: Scalable dense monocular surface reconstruction. In: International Conference on 3D Vision (3DV) (2017)
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. (IJCV) 92(1), 1–31 (2011)
Bartoli, A., Gay-Bellile, V., Castellani, U., Peyras, J., Olsen, S., Sayd, P.: Coarse-to-fine low-rank structure-from-motion. In: Computer Vision and Pattern Recognition (CVPR) (2008)
Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from image streams. In: Computer Vision and Pattern Recognition (CVPR) (2000)
Bue, A.D.: A factorization approach to structure from motion with shape priors. In: Computer Vision and Pattern Recognition (CVPR) (2008)
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). In: International Conference on Learning Representations (ICLR) (2016)
Dai, Y., Deng, H., He, M.: Dense non-rigid structure-from-motion made easy - a spatial-temporal smoothness based solution. In: International Conference on Image Processing (ICIP), pp. 4532–4536 (2017)
Dai, Y., Li, H., He, M.: Simple prior-free method for non-rigid structure-from-motion factorization. Int. J. Comput. Vis. (IJCV) 107, 101–122 (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR) (2009)
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: Computer Vision and Pattern Recognition (CVPR) (2017)
Garg, R., Roussos, A., Agapito, L.: Dense variational reconstruction of non-rigid surfaces from monocular video. In: Computer Vision and Pattern Recognition (CVPR) (2013)
Garg, R., Roussos, A., Agapito, L.: A variational approach to video registration with subspace constraints. Int. J. Comput. Vis. (IJCV) 104(3), 286–314 (2013)
Golyanik, V., Fetzer, T., Stricker, D.: Accurate 3D reconstruction of dynamic scenes from monocular image sequences with severe occlusions. In: Winter Conference on Applications of Computer Vision (WACV), pp. 282–291 (2017)
Golyanik, V., Stricker, D.: Dense batch non-rigid structure from motion in a second. In: Winter Conference on Applications of Computer Vision (WACV), pp. 254–263 (2017)
Golyanik, V., Fetzer, T., Stricker, D.: Introduction to coherent depth fields for dense monocular surface recovery. In: British Machine Vision Conference (BMVC) (2017)
Golyanik, V., Jonas, A., Stricker, D.: Consolidating segmentwise non-rigid structure from motion. In: Machine Vision Applications (MVA) (2019)
Golyanik, V., Jonas, A., Stricker, D., Theobalt, C.: Intrinsic Dynamic Shape Prior for Fast, Sequential and Dense Non-Rigid Structure from Motion with Detection of Temporally-Disjoint Rigidity. arXiv e-prints (2019)
Golyanik, V., Mathur, A.S., Stricker, D.: NRSfm-Flow: recovering non-rigid scene flow from monocular image sequences. In: British Machine Vision Conference (BMVC) (2016)
Golyanik, V., Shimada, S., Varanasi, K., Stricker, D.: HDM-Net: monocular non-rigid 3D reconstruction with learned deformation model. In: Bourdot, P., Cobb, S., Interrante, V., kato, H., Stricker, D. (eds.) EuroVR 2018. LNCS, vol. 11162, pp. 51–72. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01790-3_4
Gotardo, P.F.U., Martinez, A.M.: Kernel non-rigid structure from motion. In: International Conference on Computer Vision (ICCV), pp. 802–809 (2011)
Gotardo, P.F.U., Martinez, A.M.: Non-rigid structure from motion with complementary rank-3 spaces. In: Computer Vision and Pattern Recognition (CVPR), pp. 3065–3072 (2011)
Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: AtlasNet: a Papier-Mâché approach to learning 3D surface generation. In: Computer Vision and Pattern Recognition (CVPR) (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 386–402. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_23
Kong, C., Lucey, S.: Deep non-rigid structure from motion. In: International Conference on Computer Vision (ICCV) (2019)
Kovalenko, O., Golyanik, V., Malik, J., Elhayek, A., Stricker, D.: Structure from articulated motion: accurate and stable monocular 3D reconstruction without training data. Sensors 19(20), 4603 (2019)
Kumar, S.: Jumping manifolds: geometry aware dense non-rigid structure from motion. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Kumar, S., Cherian, A., Dai, Y., Li, H.: Scalable dense non-rigid structure-from-motion: a grassmannian perspective. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Lee, M., Cho, J., Choi, C.H., Oh, S.: Procrustean normal distribution for non-rigid structure from motion. In: Computer Vision and Pattern Recognition (CVPR) (2013)
Lee, M., Choi, C.H., Oh, S.: A procrustean Markov process for non-rigid structure recovery. In: Computer Vision and Pattern Recognition (CVPR) (2014)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Novotny, D., Ravi, N., Graham, B., Neverova, N., Vedaldi, A.: C3DPO: canonical 3D pose networks for non-rigid structure from motion. In: International Conference on Computer Vision (ICCV) (2019)
Östlund, J., Varol, A., Ngo, D.T., Fua, P.: Laplacian meshes for monocular 3D shape recovery. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 412–425. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_30
Paladini, M., Del Bue, A., Xavier, J., Agapito, L., Stosić, M., Dodig, M.: Optimal metric projections for deformable and articulated structure-from-motion. Int. J. Comput. Vis. (IJCV) 96(2), 252–276 (2012)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Paszke, A., et al.: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philoso. Mag. 2, 559–572 (1901)
Pumarola, A., Agudo, A., Porzi, L., Sanfeliu, A., Lepetit, V., Moreno-Noguer, F.: Geometry-aware network for non-rigid shape prediction from a single view. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: International Conference on Neural Networks (ICNN), pp. 586–591 (1993)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Russell, C., Fayad, J., Agapito, L.: Energy based multiple model fitting for non-rigid structure from motion. In: Computer Vision and Pattern Recognition (CVPR), pp. 3009–3016 (2011)
Russell, C., Fayad, J., Agapito, L.: Dense non-rigid structure from motion. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization Transmission (3DIMPVT) (2012)
Sahasrabudhe, M., Shu, Z., Bartrum, E., Alp Güler, R., Samaras, D., Kokkinos, I.: Lifting autoencoders: unsupervised learning of a fully-disentangled 3D morphable model using deep non-rigid structure from motion. In: International Conference on Computer Vision Workshops (ICCVW) (2019)
Salzmann, M., Fua, P.: Reconstructing sharply folding surfaces: a convex formulation. In: Computer Vision and Pattern Recognition (CVPR), pp. 1054–1061 (2009)
Shimada, S., Golyanik, V., Theobalt, C., Stricker, D.: IsMo-GAN: adversarial learning for monocular non-rigid 3D reconstruction. In: Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
Sorkine, O.: Laplacian mesh processing. In: Annual Conference of the European Association for Computer Graphics (Eurographics) (2005)
Stoyanov, D.: Stereoscopic scene flow for robotic assisted minimally invasive surgery. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 479–486. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33415-3_59
Taetz, B., Bleser, G., Golyanik, V., Stricker, D.: Occlusion-aware video registration for highly non-rigid objects. In: Winter Conference on Applications of Computer Vision (WACV) (2016)
Tewari, A., et al.: FML: face model learning from videos. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Tewari, A., et al.: MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: International Conference on Computer Vision (ICCV) (2017)
Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vis. (IJCV) 9(2), 137–154 (1992)
Torresani, L., Hertzmann, A., Bregler, C.: Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. Trans. Pattern Anal. Mach. Intell. (TPAMI) 30(5), 878–892 (2008)
Tsoli, A., Argyros, A.A.: Patch-based reconstruction of a textureless deformable 3D surface from a single RGB image. In: International Conference on Computer Vision Workshops (ICCVW) (2019)
Valgaerts, L., Wu, C., Bruhn, A., Seidel, H.P., Theobalt, C.: Lightweight binocular facial performance capture under uncontrolled lighting. ACM Trans. Graph. (TOG) 31(6), 187:1–187:11 (2012)
Varol, A., Salzmann, M., Fua, P., Urtasun, R.: A constrained latent variable model. In: Computer Vision and Pattern Recognition (CVPR) (2012)
Vicente, S., Agapito, L.: Soft inextensibility constraints for template-free non-rigid reconstruction. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 426–440. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_31
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: generating 3D mesh models from single RGB images. In: European Conference on Computer Vision (ECCV) (2018)
Xiao, J., Chai, J., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: European Conference on Computer Vision (ECCV) (2004)
Yu, R., Russell, C., Campbell, N.D.F., Agapito, L.: Direct, dense, and deformable: template-based non-rigid 3D reconstruction from RGB video. In: International Conference on Computer Vision (ICCV) (2015)
Zhu, Y., Huang, D., Torre, F.D.L., Lucey, S.: Complex non-rigid motion 3D reconstruction by union of subspaces. In: Computer Vision and Pattern Recognition (CVPR), pp. 1542–1549 (2014)
Acknowledgement
This work was supported by the ERC Consolidator Grant 4DReply (770784) and the Spanish Ministry of Science and Innovation under project HuMoUR TIN2017-90086-R. The authors thank Mallikarjun B R for help with running the FML method [58] on our data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sidhu, V., Tretschk, E., Golyanik, V., Agudo, A., Theobalt, C. (2020). Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-58517-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58516-7
Online ISBN: 978-3-030-58517-4
eBook Packages: Computer ScienceComputer Science (R0)