We introduce an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without geometric supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we propose two types of residual motion learning frameworks to explicitly disentangle camera and object motions in dynamic driving scenes with different levels of semantic prior knowledge: video instance segmentation as a strong prior, and object detection as a weak prior. Third, we design a unified photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we present a unsupervised method of 3D motion field regularization for semantically plausible object motion representation. Our proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI, Cityscapes, and Waymo open dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are publicly available
This is different from the reversed optical flow leveraged in Liu et al. (2019); Wang et al. (2019); Luo et al. (2019). Since flow-based warping techniques do not consider geometric structure, serious distortions will appear where multiple source pixels are warped to the same target locations, e.g., object boundaries, as shown in Fig. 2b. Our forward and inverse warping are not about temporal order, but rather which coordinate frame from which to conduct the geometric transformation when warping from the reference to the target view. Hereafter, we express forward projection as forward warping for consistency with inverse warping.
Previous works (Gordon et al., 2019; Li et al., 2020) have alleviated this issue by applying motion smoothness term. This is fair, but only nearby motion vectors are regularized. On the other hands, our regularization method plays with the distribution of motion vectors. Considering the rigidity of the moving objects, e.g., mostly vehicles on traffic roads, we postulate that boosting consistency over a set of whole motion vectors for each object is more helpful to learn semantically plausible object motion field.
In our previous work (Lee et al., 2021) we proposed contrastive sample consensus (CSAC). While CSAC focuses on the motion boundary between the object and background (modulating two distributions), HSAC has a more general perspective to find and converge to a target value by observing its internal distribution without any supervision.
This is why we postulate Assumption 1. We use this only for the initial object mask. Since it is not accurate, we calculate the regularization loss (\(\textsc {CalcPenalty}()\) in line 16) by excluding query vectors that deviate significantly (\(s_k < 0.01\)).
This work was supported by the KENTECH Research Grant (KRG2022-01-003), the DGIST R &D Program of the Ministry of Science and ICT (20-CoE-IT-01), and the International Research and Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT under Grant NRF-2021K1A3A1A21040016.
Lee, S., Rameau, F., Im, S. et al. Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue. Int J Comput Vis 130, 2265–2285 (2022). https://doi.org/10.1007/s11263-022-01641-5
