Abstract
Recently, self-supervised methods based on self-supervised transformer features have demonstrated promising results in unsupervised object localization. However, obtaining exceptional semantic results remains a formidable challenge. The current approaches heavily rely on the similarity between patch-level features within an image, while lacking supervision from image-level information. Meanwhile, the Segment Anything Model (SAM) has demonstrated remarkable class-agnostic segmentation capabilities for arbitrary objects in images with sparse prompts like points. In this work, we propose SelfLoc, a simple yet effective self-supervised object localization method via integration with self-prompt SAM. Specifically, a self-prompt generator is designed to automatically generate sparse prompts based on an image’s self-attention map. Simultaneously, an image-wise integration module is developed to enhance the coarse mask obtained from self-supervised features by leveraging the fine-grained segmentation results of SAM. Extensive experimental results demonstrate that the proposed method not only achieves state-of-the-art performance in unsupervised saliency detection and object discovery tasks, but also sets a new benchmark in unsupervised camouflaged object segmentation. The source code will be publicly available at https://github.com/Rogersiy/SelfLoc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Chen, T., Mai, Z., Li, R., lun Chao, W.: Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
Fan, D.P., Ji, G.P., Sun, G., Cheng, M.M., Shen, J., Shao, L.: Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2777–2787 (2020)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners (2021). arXiv:2111.06377
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 184, 45–56 (2019)
Li, X., Lin, C.C., Chen, Y., Liu, Z., Wang, J., Raj, B.: Paintseg: training-free segmentation via painting (2023). arXiv:2305.19406
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Lv, Y., Zhang, J., Barnes, N., Dai, Y.: Weakly-supervised contrastive learning for unsupervised object discovery (2023). arXiv:2307.03376
Lv, Y., Zhang, J., Dai, Y., Li, A., Liu, B., Barnes, N., Fan, D.P.: Simultaneously localize, segment and rank the camouflaged objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11591–11601 (2021)
Margolin, R., Zelnik-Manor, L., Tal, A.: How to evaluate foreground maps? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2014)
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8364–8375 (2022)
Nguyen, T., Dax, M., Mummadi, C.K., Ngo, N., Nguyen, T.H.P., Lou, Z., Brox, T.: Deepusps: deep robust unsupervised saliency prediction via self-supervision. Adv. Neural Inf. Process. Syst. 32 (2019)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Seitzer, M., Horn, M., Zadaianchuk, A., Zietlow, D., Xiao, T., Simon-Gabriel, C.J., He, T., Zhang, Z., Schölkopf, B., Brox, T., Locatello, F.: Bridging the gap to real-world object-centric learning. In: ICLR 2023 (2023)
Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended CSSD. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 717–729 (2015)
Shin, G., Albanie, S., Xie, W.: Unsupervised salient object detection with spectral cluster voting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3971–3980 (2022)
Siméoni, O., Puy, G., Vo, H.V., Roburin, S., Gidaris, S., Bursuc, A., Pérez, P., Marlet, R., Ponce, J.: Localizing objects with self-supervised transformers and no labels (2021). arXiv:2109.14279
Siméoni, O., Sekkat, C., Puy, G., Vobecky, A., Zablocki, ., Pérez, P.: Unsupervised object localization: Observing the background to discover objects. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2023)
Skurowski, P., Abdulameer, H., Błaszczyk, J., Depta, T., Kornacki, A., Kozieł, P.: Animal camouflage analysis: Chameleon database (2018), unpublished manuscript
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vision 104, 154–171 (2013)
Vo, H.V., Pérez, P., Ponce, J.: Toward unsupervised, multi-object discovery in large-scale image collections. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pp. 779–795. Springer (2020)
Vo, V.H., Sizikova, E., Schmid, C., Pérez, P., Ponce, J.: Large-scale unsupervised object discovery. Adv. Neural. Inf. Process. Syst. 34, 16764–16778 (2021)
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)
Wang, X., Yu, Z., De Mello, S., Kautz, J., Anandkumar, A., Shen, C., Alvarez, J.M.: FreeSOLO: learning to segment objects without annotations (2022). arXiv:2202.12181
Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Conference on Computer Vision and Pattern Recognition (2022)
Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1162 (2013)
Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3166–3173 (2013)
Zhang, Y., Wu, C.: Unsupervised camouflaged object segmentation as domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4334–4344 (2023)
Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2814–2821 (2014)
Acknowledgments
This work is supported in part by the Science and Technology Program of Qingdao (24-1-8-cspz-22-nsh).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, J., Wang, X., Li, C., Chen, L., Wang, S. (2025). SelfLoc: High Quality Unsupervised Object Localization with Self-Prompt SAM. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15042. Springer, Singapore. https://doi.org/10.1007/978-981-97-8858-3_35
Download citation
DOI: https://doi.org/10.1007/978-981-97-8858-3_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8857-6
Online ISBN: 978-981-97-8858-3
eBook Packages: Computer ScienceComputer Science (R0)