Unsupervised Segmentation in Real-World Images via Spelke Object Inference

Honglin Chen¹²,
Rahul Venkatesh¹²,
Yoni Friedman¹⁵,
Jiajun Wu¹²,
Joshua B. Tenenbaum¹⁵,
Daniel L. K. Yamins^12,13,14 &
…
Daniel M. Bear^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13689))

Included in the following conference series:

European Conference on Computer Vision

2493 Accesses
10 Citations

Abstract

Self-supervised, category-agnostic segmentation of real-world images is a challenging open problem in computer vision. Here, we show how to learn static grouping priors from motion self-supervision by building on the cognitive science concept of a Spelke Object: a set of physical stuff that moves together. We introduce the Excitatory-Inhibitory Segment Extraction Network (EISEN), which learns to extract pairwise affinity graphs for static scenes from motion-based training signals. EISEN then produces segments from affinities using a novel graph propagation and competition network. During training, objects that undergo correlated motion (such as robot arms and the objects they move) are decoupled by a bootstrapping process: EISEN explains away the motion of objects it has already learned to segment. We show that EISEN achieves a substantial improvement in the state of the art for self-supervised image segmentation on challenging synthetic and real-world robotics datasets.

D. L. K. Yamins and D. M. Bear—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Benchmarking and Analysis of Unsupervised Object Segmentation from Real-World Single Images

Article Open access 06 January 2024

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding

Notes

1.
More formally, two pieces of stuff are considered to be in the same Spelke object if and only if, under the application of any sequence of actions that causes sustained motion of one of the pieces of stuff, the magnitude of the motion that the other piece of stuff experiences relative to the first piece is approximately zero compared to the magnitude of overall motion. Natural action groups arise from the set of all force applications exertable by specific physical actuator, such as (e.g.) a pair of human hands or a robotic gripper.
2.
If scenes are assumed to have at most one independent motion source, these are simply the pairs with \(\mathcal {I}(a) == \mathcal {I}(b) == 1\). This often holds in robotics scenes (and is perhaps the norm in a baby’s early visual experience) but not in many standard datasets (e.g. busy street scenes.) We therefore handle the more general case.

References

Arora, T., Li, L.E., Cai, M.B.: Learning to perceive objects by prediction. In: SVRHM 2021 Workshop@ NeurIPS (2021)
Google Scholar
Bear, D., et al.: Learning physical graph representations from visual scenes. In: Advances in Neural Information Processing Systems 33, pp. 6027–6039 (2020)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Cheng, B., et al.: Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12475–12485 (2020)
Google Scholar
Dorfman, N., Harari, D., Ullman, S.: Learning to perceive coherent objects. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 35 (2013)
Google Scholar
Du, Y., Smith, K., Ulman, T., Tenenbaum, J., Wu, J.: Unsupervised discovery of 3D physical objects from video. arXiv preprint arXiv:2007.12348 (2020)
Ebert, F., et al.: Bridge data: boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 (2021)
Follmann, P., Böttger, T., Härtinger, P., König, R., Ulrich, M.: MVTec D2S: densely segmented supermarket dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 581–597. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_35
Chapter Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
Gan, C., et al.: ThreeDWorld: a platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954 (2020)
Gao, N., et al.: SSAP: single-shot instance segmentation with affinity pyramid. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 642–651 (2019)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Greff, K., et al.: Multi-object representation learning with iterative variational inference. In: International Conference on Machine Learning, pp. 2424–2433. PMLR (2019)
Google Scholar
Gregory, S.: Finding overlapping communities in networks by label propagation. New J. Phys. 12(10), 103018 (2010)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Hinton, G.: How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627 (2021)
Kabra, R., et al.: SIMONe: view-invariant, temporally-abstracted object representations via unsupervised video decomposition. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Kipf, T., et al.: Conditional object-centric learning from video. arXiv preprint arXiv:2111.12594 (2021)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
Locatello, F., et al.: Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems 33, pp. 11525–11538 (2020)
Google Scholar
Luo, L., Xiong, Y., Liu, Y., Sun, X.: Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)
Peng, B., Zhang, L., Zhang, D.: A survey of graph theoretical approaches to image segmentation. Pattern Recogn. 46(3), 1020–1038 (2013)
Article Google Scholar
Perazzi, F., et al.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Google Scholar
Roelfsema, P.R., et al.: Cortical algorithms for perceptual grouping. Ann. Rev. Neurosci. 29(1), 203–227 (2006)
Article Google Scholar
Ross, M.G., Kaelbling, L.P.: Segmentation according to natural examples: learning static segmentation from motion segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 661–676 (2008)
Article Google Scholar
Sabour, S., Tagliasacchi, A., Yazdani, S., Hinton, G., Fleet, D.J.: Unsupervised part representation by flow capsules. In: International Conference on Machine Learning, pp. 9213–9223. PMLR (2021)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279 (2021)
Spelke, E.S.: Principles of object perception. Cogn. Sci. 14(1), 29–56 (1990)
Article Google Scholar
Tangemann, M., et al.: Unsupervised object learning via common fate. arXiv preprint arXiv:2110.06562 (2021)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Todorovic, D.: Gestalt principles. Scholarpedia 3(12), 5345 (2008)
Article Google Scholar
Tsao, T., Tsao, D.Y.: A topological solution to object segmentation and tracking. arXiv preprint arXiv:2107.02036 (2021)
Ullman, S., Harari, D., Dorfman, N.: From simple innate biases to complex visual concepts. Proc. Natl. Acad. Sci. 109(44), 18215–18220 (2012)
Article Google Scholar
Wang, Y., Shen, X., Hu, S., Yuan, Y., Crowley, J., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. arXiv preprint arXiv:2202.11539 (2022)
Yang, C., Lamdouar, H., Lu, E., Zisserman, A., Xie, W.: Self-supervised video object segmentation by motion grouping. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7177–7188 (2021)
Google Scholar
Zhou, H., Friedman, H.S., Von Der Heydt, R.: Coding of border ownership in monkey visual cortex. J. Neurosci. 20(17), 6594–6611 (2000)
Article Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgements

J.B.T is supported by NSF Science Technology Center Award CCF-1231216. D.L.K.Y is supported by the NSF (RI 1703161 and CAREER Award 1844724) and hardware donations from the NVIDIA Corporation. J.B.T. and D.L.K.Y. are supported by the DARPA Machine Common Sense program. J.W. is in part supported by Stanford HAI, Samsung, ADI, Salesforce, Bosch, and Meta. D.M.B. is supported by a Wu Tsai Interdisciplinary Scholarship and is a Biogen Fellow of the Life Sciences Research Foundation. We thank Chaofei Fan and Drew Linsley for early discussions about EISEN.

Author information

Authors and Affiliations

Department of Computer Science, Stanford, USA
Honglin Chen, Rahul Venkatesh, Jiajun Wu & Daniel L. K. Yamins
Department of Psychology, Stanford, USA
Daniel L. K. Yamins & Daniel M. Bear
Wu Tsai Neurosciences Institute, Stanford, USA
Daniel L. K. Yamins & Daniel M. Bear
Department of Brain and Cognitive Sciences, CBMM, MIT, Cambridge, USA
Yoni Friedman & Joshua B. Tenenbaum

Authors

Honglin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Venkatesh
View author publications
You can also search for this author in PubMed Google Scholar
Yoni Friedman
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Joshua B. Tenenbaum
View author publications
You can also search for this author in PubMed Google Scholar
Daniel L. K. Yamins
View author publications
You can also search for this author in PubMed Google Scholar
Daniel M. Bear
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Honglin Chen .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1086 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, H. et al. (2022). Unsupervised Segmentation in Real-World Images via Spelke Object Inference. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13689. Springer, Cham. https://doi.org/10.1007/978-3-031-19818-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-19818-2_41
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19817-5
Online ISBN: 978-3-031-19818-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Segmentation in Real-World Images via Spelke Object Inference

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Benchmarking and Analysis of Unsupervised Object Segmentation from Real-World Single Images

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1086 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Unsupervised Segmentation in Real-World Images via Spelke Object Inference

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Benchmarking and Analysis of Unsupervised Object Segmentation from Real-World Single Images

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1086 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation