[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Published: 01 July 2021 Publication History

Abstract

The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection, and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which we use the progressive predictions from the well-learned source modality to supervise learning feature hierarchies and inference in the new modality. To better select complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal cross-level interactions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in learning from a new modality, the advantages of the proposed multi-modal fusion pattern in selecting and fusing cross-modal complements, and the generalization of the proposed designs in different tasks.

References

[1]
Alpert, S., Galun, M., Basri, R., & Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1–8).
[2]
Borji A, Cheng MM, Jiang H, and Li J Salient object detection: A benchmark IEEE Transactions on Image Processing 2015 24 12 5706-5722
[3]
Camplani, M., Hannuna, S.L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L., & Burghardt, T. (2015). Real-time rgb-d tracking with depth scaling kernelised correlation filters and occlusion handling. In Proceedings of the British machine vision conference (pp. 145–1).
[4]
Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3051–3060).
[5]
Chen, H., Li, Y., & Su, D. (2018). Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition.
[6]
Chen H and Li Y Three-stream attention-aware network for RGB-D salient object detection IEEE Transactions on Image Processing 2019 28 6 2825-2835
[7]
Cheng, Y., Cai, R., Li, Z., Zhao, X., & Huang, K. (2017). Locality sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 3).
[8]
Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014) Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service (ICIMCS) (pp. 23–27).
[9]
Cheng MM, Mitra NJ, Huang X, Torr PH, and Hu SM Global contrast based salient region detection IEEE Transactions on Pattern Analysis and Machine Intelligence 2015 37 3 569-582
[10]
Christoudias, C.M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In Proceedings of European conference on computer vision (pp. 677–691).
[11]
Ciptadi, A., Hermans, T., & Rehg, J. M. (2013). An in depth view of saliency. In Proceedings of the British machine vision conference.
[12]
Cong, R., Lei, J., Fu, H., Lin, W., Huang, Q., Cao, X., & Hou, C. (2017). An iterative co-saliency framework for RGBD images. IEEE Transactions on Cybernetics
[13]
Cong R, Lei J, Fu H, Huang Q, Cao X, and Hou C Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation IEEE Transactions on Image Processing 2018 27 2 568-579
[14]
Cong R, Lei J, Zhang C, Huang Q, Cao X, and Hou C Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion Signal Processing Letters 2016 23 6 819-823
[15]
Desingh, K., Krishna, K. M., Rajan, D., & Jawahar, C. (2013). Depth really matters: Improving visual salient region detection with depth. In Proceedings of the British machine vision conference.
[16]
Du, D., Wang, L., Wang, H., Zhao, K, & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 11836–11845.
[17]
Fan, D. P., Cheng, M. M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE computer society conference on computer vision (pp. 4548–4557).
[18]
Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of IJCAI.
[19]
Fan, D. P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M. M. (2020). Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networking Learning Systems.
[20]
Fan, X., Liu, Z., & Sun, G. (2014). Salient region detection for stereoscopic images. In Proceedings of international conference on digital signal process (pp. 454–458).
[21]
Feng, D., Barnes, N., You, S., & McCarthy, C. (2016). Local background enclosure for rgb-d salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2343–2350).
[22]
Fu, H., Xu, D., Lin, S., & Liu, J. (2015). Object-based rgbd image co-segmentation with mutex constraint. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4428–4436).
[23]
Fu H, Xu D, and Lin S Object-based multiple foreground segmentation in RGBD video IEEE Transactions on Image Processing 2017 26 3 1418-1427
[24]
Garcia, N.C., Morerio, P., & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In Proceedings of European conference on computer vision (pp. 103–118).
[25]
Guo, J., Ren, T., & Bei, J. (2016). Salient object detection for RGB-D image via saliency evolution. In Proceedings of IEEE international conference on multimedia and expo (pp. 1–6).
[26]
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of European conference on computer vision (pp. 345–360).
[27]
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2827–2836).
[28]
Gupta S, Arbeláez P, Girshick R, and Malik J Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation International Journal of Computer Vision 2015 112 2 133-149
[29]
Han, J., Chen, H., Liu, N., Yan, C., & Li, X. (2017). Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics.
[30]
Han J, Shao L, Xu D, and Shotton J Enhanced computer vision with microsoft kinect sensor: A review IEEE Transactions on Cybernetics 2013 43 5 1318-1334
[31]
Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In Proceedings of advances in neural information processing systems (pp. 545–552).
[32]
Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of Asian conference computer vision (pp. 213–228). Springer.
[33]
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
[34]
Hoffman, J., Gupta, S., & Darrell, T. (2016). Learning with side information through modality hallucination. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 826–834).
[35]
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., & Torr, P. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 5300–5309).
[36]
Hou, J., Dai, A., & Nießner, M. (2019). 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4421–4430).
[37]
Huang, Z., & Wang, N. (2017). Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219.
[38]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on multimedia (pp. 675–678).
[39]
Ju, R., Ge, L., Geng, W., Ren, T., & Wu, G. (2014). Depth saliency based on anisotropic center-surround difference. In Proceedings of European conference on image process (pp. 1115–1119).
[40]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of advances in neural information processing systems (pp. 1097–1105).
[41]
Lang, C., Nguyen, T. V., Katti, H., Yadati, K., Kankanhalli, M., & Yan, S. (2012). Depth matters: Influence of depth cues on visual saliency. In Proceedings of European conference on computer vision (pp. 101–115).
[42]
Lenc, K., & Vedaldi, A. (2015). Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 991–999).
[43]
Li, Q., Jin, S., & Yan, J. (2017). Mimicking very efficient network for object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7341–7349).
[44]
Li, J., Liu, Y., Gong, D., Shi, Q., Yuan, X., Zhao, C., & Reid, I. (2019). RGBD based dimensional decomposition residual network for 3D semantic scene completion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7693–7702).
[45]
Li, G., Liu, Z., Ye, L., Wang, Y., & Ling, H. (2020b). Cross-modal weighting network for RGB-D salient object detection. In Proceedings of European conference on computer vision (pp. 665–681).
[46]
Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2806–2813).
[47]
Li G, Gan Y, Wu H, Xiao N, and Lin L Cross-modal attentional context learning for RGB-D object detection IEEE Transactions on Image Processing 2018 28 4 1591-1601
[48]
Li G, Liu Z, and Ling H ICNet: Information conversion network for RGB-D based salient object detection IEEE Transactions on Image Processing 2020 29 4873-4884
[49]
Lin, D., Chen, G., Cohen-Or. D., Heng. P. A., & Huang, H. (2017a). Cascaded feature network for semantic segmentation of RGB-D images. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1320–1328).
[50]
Lin, G., Milan, A., Shen, C., & Reid, I. (2017b). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (Vol. 1, p. 3).
[51]
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3431–3440).
[52]
Mahadevan V, Vasconcelos N, et al. Biologically inspired object tracking using center-surround saliency mechanisms IEEE Transactions on Pattern Analysis and Machine Intelligence 2013 35 3 541-554
[53]
Margolin, R., Zelnik-Manor, L., & Tal, A. (2014). How to evaluate foreground maps? In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 248–255).
[54]
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3994–4003).
[55]
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of international conference on machine learning (pp. 689–696).
[56]
Niu, Y., Geng, Y., Li, X., & Liu, F. (2012). Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 454–461).
[57]
Park, S.J., Hong, K.S., & Lee, S. (2017). RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition.
[58]
Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). RGBD salient object detection: a benchmark and algorithms. In Proceedings of European conference on computer vision (pp. 92–109).
[59]
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE computer society conference on computer vision (pp. 7254–7263).
[60]
Piao, Y., Rong, Z., Zhang, M., & Lu, H. (2020). Exploit and replace: An asymmetrical two-stream architecture for versatile light field saliency detection. In AAAI (pp. 11865–11873).
[61]
Qi, X., Liao, R., Jia, J., Fidler, S., & Urtasun, R. (2017). 3D graph neural networks for RGBD semantic segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 5199–5208).
[62]
Qu L, He S, Zhang J, Tian J, Tang Y, and Yang Q Rgbd salient object detection via deep fusion IEEE Transactions on Image Processing 2017 26 5 2274-2285
[63]
Ren, J., Gong, X., Yu, L., Zhou, W., & Ying Yang, M. (2015). Exploiting global priors for RGB-D saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 25–32).
[64]
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014). Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
[65]
Shao L and Brady M Specific object retrieval based on salient regions Pattern Recognition 2006 39 10 1932-1948
[66]
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Proceedings of advances in neural information processing systems (pp. 935–943).
[67]
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 567–576).
[68]
Song H, Liu Z, Du H, Sun G, Le Meur O, and Ren T Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning IEEE Transactions on Image Processing 2017 26 9 4204-4216
[69]
Wang, W., & Neumann, U. (2018). Depth-aware CNN for RGB-D segmentation. In Proceedings of the IEEE computer society conference on computer vision (pp. 135–150).
[70]
Wang, A., Cai, J., Lu, J., & Cham, T. J. (2015). Mmss: Multi-modal sharable and specific feature learning for RGB-D object recognition. In Proceedings of the IEEE computer society conference on computer vision (pp. 1125–1133).
[71]
Xu X, Li Y, Wu G, and Luo J Multi-modal deep feature learning for RGB-D object detection Pattern Recognition 2017 72 300-313
[72]
Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1155–1162).
[73]
Yang J and Yang MH Top-down visual saliency via joint crf and dictionary learning IEEE Transactions on Pattern Analysis and Machine Intelligence 2017 39 3 576-588
[74]
Zeng, J., Tong, Y., Huang, Y., Yan, Q., Sun, W., Chen, J., & Wang, Y. (2019). Deep surface normal estimation with hierarchical RGB-D fusion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 6153–6162).
[75]
Zhang, M., Li, J., Wei, J., Piao, Y., & Lu, H. (2019). Memory-oriented decoder for light field salient object detection. In Proceedings of advances in neural information processing systems (pp. 898–908).
[76]
Zhang M, Ji W, Piao Y, Li J, Zhang Y, Xu S, and Lu H LFNet: Light field fusion network for salient object detection IEEE Transactions on Image Processing 2020 29 6276-6287
[77]
Zhao, J. X., Cao, Y., Fan, D. P., Cheng, M. M., Li, X. Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3927–3936).
[78]
Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2020a). Suppress and balance: A simple gated network for salient object detection. In Proceedings of European conference on computer vision.
[79]
Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020b). A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of European conference on computer vision.
[80]
Zhou, T., Fan, D. P., Cheng, M. M., Shen, J., & Shao, L. (2020). RGB-D salient object detection: A survey. Computational Visual Media, pp. 1–33
[81]
Zhu, C., & Li, G. (2017). A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In Proceedings of the IEEE computer society conference on computer vision (pp. 3008–3014).
[82]
Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2969–2976).

Cited By

View all
  • (2024)Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and BeyondIEEE Transactions on Image Processing10.1109/TIP.2024.336402233(1699-1709)Online publication date: 1-Jan-2024
  • (2024)MutualFormer: Multi-modal Representation Learning via Cross-Diffusion AttentionInternational Journal of Computer Vision10.1007/s11263-024-02067-x132:9(3867-3888)Online publication date: 1-Sep-2024
  • (2023)LeNoProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i2.25351(2537-2545)Online publication date: 7-Feb-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision
International Journal of Computer Vision  Volume 129, Issue 7
Jul 2021
279 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 July 2021
Accepted: 28 February 2021
Received: 17 January 2020

Author Tags

  1. RGB-D
  2. Salient object detection
  3. Convolutional neural network
  4. Cross-modal distillation

Qualifiers

  • Research-article

Funding Sources

  • Research Grants Council of Hong Kong

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and BeyondIEEE Transactions on Image Processing10.1109/TIP.2024.336402233(1699-1709)Online publication date: 1-Jan-2024
  • (2024)MutualFormer: Multi-modal Representation Learning via Cross-Diffusion AttentionInternational Journal of Computer Vision10.1007/s11263-024-02067-x132:9(3867-3888)Online publication date: 1-Sep-2024
  • (2023)LeNoProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i2.25351(2537-2545)Online publication date: 7-Feb-2023
  • (2022)MoADNet: Mobile Asymmetric Dual-Stream Networks for Real-Time and Lightweight RGB-D Salient Object DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.318027432:11(7632-7645)Online publication date: 1-Nov-2022

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media