Abstract
We present an unsupervised approach for learning a layered representation of a scene from a video for motion segmentation. Our method is applicable to any video containing piecewise parametric motion. The learnt model is a composition of layers, which consist of one or more segments. The shape of each segment is represented using a binary matte and its appearance is given by the rgb value for each point belonging to the matte. Included in the model are the effects of image projection, lighting, and motion blur. Furthermore, spatial continuity is explicitly modeled resulting in contiguous segments. Unlike previous approaches, our method does not use reference frame(s) for initialization. The two main contributions of our method are: (i) A novel algorithm for obtaining the initial estimate of the model by dividing the scene into rigidly moving components using efficient loopy belief propagation; and (ii) Refining the initial estimate using α β-swap and α-expansion algorithms, which guarantee a strong local minima. Results are presented on several classes of objects with different types of camera motion, e.g. videos of a human walking shot with static or translating cameras. We compare our method with the state of the art and demonstrate significant improvements.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal, A., & Triggs, B. (2004). Tracking articulated motion using a mixture of autoregressive models. In ECCV (Vol. III, pp. 54–65).
Black, M., & Fleet, D. (2000). Probabilistic detection and tracking of motion discontinuities. International Journal of Computer Vision, 38, 231–245.
Blake, A., Rother, C., Brown, M., Perez, P., & Torr, P. H. S. (2004). Interactive image segmentation using an adaptive GMMRF model. In ECCV (Vol. I, pp. 428–441).
Boykov, Y., & Jolly, M. P. (2001). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In ICCV (Vol. I, pp. 105–112).
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.
Cremers, D., & Soatto, S. (2003). Variational space-time motion segmentation. In ICCV (Vol. II, pp. 886–892).
Felzenszwalb, P. F., & Huttenlocher, D. P. (2003). Fast algorithms for large state space HMMs with applications to web usage analysis. In NIPS.
Jojic, N., & Frey, B. (2001). Learning flexible sprites in video layers. In CVPR (Vol. 1, pp. 199–206).
Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Learning layered pictorial structures from video. In ICVGIP (pp. 148–153).
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005a). Learning layered motion segmentations of video. In ICCV (Vol. I, pp. 33–40).
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005b). OBJ CUT. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 18–25).
Lafferty, J., McCallum, A., & Pereira, F. (2005). Conditional random fields: probabilistic models for segmenting and labelling sequence data. In ICML.
Magee, D. R., & Boyle, R. D. (2002). Detecting lameness using re-sampling condensation and multi-stream cyclic hidden Markov models. Image and Vision Computing, 20(8), 581–594.
Pearl, J. (1998). Probabilistic reasoning in intelligent systems: networks of plausible inference. Los Altos: Kauffman.
Ramanan, D., & Forsyth, D. A. (2003). Using temporal coherence to build models of animals. In ICCV (pp. 338–345).
Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1), 181–207.
Torr, P. H. S., & Zisserman, A. (1999). Feature based methods for structure and motion estimation. In W. Triggs, A. Zisserman, & R. Szeliski (Eds.). International workshop on vision algorithms (pp. 278–295).
Torr, P. H. S., Szeliski, R., & Anandan, P. (2001). An integrated Bayesian approach to layer extraction from image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 297–304.
Vogiatzis, G., Torr, P. H. S., Seitz, S., & Cipolla, R. (2004). Reconstructing relief surfaces. In BMVC (pp. 117–126).
Wang, J., & Adelson, E. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3(5), 625–638.
Weiss, Y., & Adelson, E. A unified mixture framework for motion segmentation. In CVPR (pp. 321–326).
Williams, C., & Titsias, M. (2004). Greedy learning of multiple objects in images using robust statistics and factorial learning. Neural Computation, 16(5), 1039–1062.
Wills, J., Agarwal, S., & Belongie, S. (2003). What went where. In CVPR (pp. I:37–44).
Winn, J., & Blake, A. (2004). Generative affine localisation and tracking. In NIPS (pp. 1505–1512).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pawan Kumar, M., Torr, P.H.S. & Zisserman, A. Learning Layered Motion Segmentations of Video. Int J Comput Vis 76, 301–319 (2008). https://doi.org/10.1007/s11263-007-0064-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-007-0064-x