Abstract
Visual affordance detection aims to understand the functional attributes of objects, which is crucial for robots to achieve interactive tasks. Most existing affordance detection methods mainly utilize the global image features for affordance detection while do not fully exploit the features of local relevant objects in the image, which often leads to suboptimal detection accuracy under the interference of cluttered backgrounds and neighbour objects. Numerous researches have proved that the accuracy of affordance detection largely depends on the quality of extracted image features. In this paper, we propose a novel affordance detection network with object shape mask guided feature encoders. The masks play as an attention mechanism that enforce the network to focus on the shape regions of target objects in the image, which facilitate to obtain high-quality features. Specifically, we first propose a shape mask guided encoder, which uses masks to effectively locate all target objects so as to extract more expressive features. Based on the encoder, we then propose a dual enhance feature aggregation module, which consists of two branches. The first branch encodes the global features of the original image, while the second branch locates each local relevant object and encodes its precise features. Aggregating these features enhances the feature representation of each object, further improving feature quality and suppressing interference. Quantitative and qualitative evaluations compared with state-of-the-art methods demonstrate that the proposed method achieves superior performance on the two commonly used affordance detection datasets.
Similar content being viewed by others
Data availability
The processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.
References
Gibson JJ, Carmichael L (1966) The Senses Considered as Perceptual Systems 2:44–73
Gibson JJ (1977) The theory of affordances. Hilldale, USA 1(2):67–82
Liu Z, Liu Q, Xu W, Wang L, Zhou Z (2022) Robot learning towards smart robotic manufacturing: A review. Robot Comput Integr Manuf 77:102360
Munguia-Galeano F, Veeramani S, Hernández JD, Wen Q, Ji Z (2023) Affordance-based human-robot interaction with reinforcement learning. IEEE Access 11:31282–31292
Wu, Y-H, Liu, Y, Zhan, X, Cheng, M-M (2022) P2t: Pyramid pooling transformer for scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–12
Hou, Z, Yu, B, Qiao, Y, Peng, X, Tao, D (2021) Affordance transfer learning for human-object interaction detection. In: IEEE conference on computer vision and pattern recognition, pp 495–504
Shao, D, Zhao, Y, Dai, B, Lin, D (2020) Finegym: A hierarchical video dataset for fine-grained action understanding. In: IEEE conference on computer vision and pattern recognition, pp 2616–2625
Gupta N, Gupta SK, Pathak RK, Jain V, Rashidi P, Suri JS (2022) Human activity recognition in artificial intelligence framework: A narrative review. Artif Intell Rev 55(6):4755–4808
Srivastava, Y, Murali, V, Dubey, SR, Mukherjee, S (2021) Visual question answering using deep learning: A survey and performance analysis. In: Computer vision and image processing: 5th international conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5, pp 75–86
Chen, L, Zheng, Y, Xiao, J (2022) Rethinking data augmentation for robust visual question answering. In: European conference on computer vision, pp 95–112
Roy, A, Todorovic, S (2016) A multi-scale cnn for affordance segmentation in rgb images. In: European conference on computer vision, pp 186–201
Cao, Y, Xu, J, Lin, S, Wei, F, Hu, H (2019) Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: IEEE international conference on computer vision, pp 0–0
Gu Q, Su J, Yuan L (2021) Visual affordance detection using an efficient attention convolutional neural network. Neurocomputing 440:36–44
Minh, CND, Gilani, SZ, Islam, SMS, Suter, D (2020) Learning affordance segmentation: An investigative study. In: 2020 Digital image computing: techniques and applications, pp 1–8
Lu, L, Zhai, W, Luo, H, Kang, Y, Cao, Y (2022) Phrase-based affordance detection via cyclic bilateral interaction. IEEE Transactions on Artificial Intelligence, pp 1–13
Nguyen, A, Kanoulas, D, Caldwell, DG, Tsagarakis, NG (2017) Object-based affordances detection with convolutional neural networks and dense conditional random fields. In: 2017 IEEE/RSJ international conference on intelligent robots and systems, pp 5908–5915
Do, T-T, Nguyen, A, Reid, I (2018) Affordancenet: An end-to-end deep learning approach for object affordance detection. In: 2018 IEEE international conference on robotics and automation, pp 5882–5889
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28:91–99
Zhao, H, Shi, J, Qi, X, Wang, X, Jia, J (2017) Pyramid scene parsing network. In: IEEE conference on computer vision and pattern recognition, pp 2881–2890
Fooladgar F, Kasaei S (2020) A survey on indoor rgb-d semantic segmentation: from hand-crafted features to deep convolutional neural networks. Multimedia Tools and Applications 79(7):4499–4524
Tang, Y, Zhang, C, Cheng, Q, Li, Z, Qian, L (2022) Fast semantic segmentation network with attention gate and multi-layer fusion. Multimedia Tools and Applications, pp 1–16
Haq NU, Khan A, Din A, Shao L, Shah S et al (2021) A novel weight initialization with adaptive hyper-parameters for deep semantic segmentation. Multimedia Tools and Applications 80(14):21771–21787
Yuan, X, Liu, C, Feng, F, Zhu, Y, Wang, Y (2022) Slice-mask based 3d cardiac shape reconstruction from ct volume. In: Proceedings of the asian conference on computer vision, pp 1909–1925
Sun, J, Chen, L, Xie, Y, Zhang, S, Jiang, Q, Zhou, X, Bao, H (2020) Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In: IEEE conference on computer vision and pattern recognition, pp 10548–10557
Myers, A, Teo, C.L, Fermüller, C, Aloimonos, Y (2015) Affordance detection of tool parts from geometric features. In: 2015 IEEE international conference on robotics and automation, pp 1374–1381
Hermans, T, Rehg, JM, Bobick, A (2011) Affordance prediction via learned object attributes. In: IEEE international conference on robotics and automation: workshop on semantic perception, mapping, and exploration, pp 181–184
Kjellström H, Romero J, Kragić D (2011) Visual object-action recognition: Inferring object affordances from human demonstration. Comput Vis Image Underst 115(1):81–90
Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32(8):951–970
He, K, Gkioxari, G, Dollár, P, Girshick, R (2017) Mask r-cnn. In: IEEE international conference on computer vision, pp 2961–2969
Bastanfard A, Amirkhani D, Mohammadi M (2022) Toward image super-resolution based on local regression and nonlocal means. Multimedia Tools and Applications 81(16):23473–23492
Zhao X, Cao Y, Kang Y (2020) Object affordance detection with relationship-aware network. Neural Comput & Applic 32(18):14321–14333
Sawatzky, J, Gall, J (2017) Adaptive binarization for weakly supervised affordance segmentation. In: IEEE international conference on computer vision, pp 1383–1391
Chu F-J, Xu R, Vela PA (2019) Learning affordance segmentation for real-world robotic manipulation via synthetic images. IEEE Robotics and Automation Letters 4(2):1140–1147
Deng, S, Xu, X, Wu, C, Chen, K, Jia, K (2021) 3d affordancenet: A benchmark for visual object affordance understanding. In: IEEE conference on computer vision and pattern recognition, pp 1778–1787
Mo, K, Zhu, S, Chang, AX, Yi, L, Tripathi, S, Guibas, LJ, Su, H (2019) Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE conference on computer vision and pattern recognition, pp 909–918
Chang, A.X, Funkhouser, T, Guibas, L, Hanrahan, P, Huang, Q, Li, Z, Savarese, S, Savva, M, Song, S, Su, H, et al (2015) Shapenet: An information-rich 3d model repository. In: arXiv:1512.03012
Xu, C, Chen, Y, Wang, H, Zhu, S-C, Zhu, Y, Huang, S (2022) Partafford: Part-level affordance discovery from 3d objects. arXiv:2202.13519
Lun, Z, Gadelha, M, Kalogerakis, E, Maji, S, Wang, R (2017) 3d shape reconstruction from sketches via multi-view convolutional networks. In: 2017 International conference on 3D vision, pp 67–77
Chen X, Li Y, Luo X, Shao T, Yu J, Zhou K, Zheng Y (2018) Autosweep: Recovering 3d editable objects from a single photograph. IEEE Trans Vis Comput Graph 26(3):1466–1475
Wimbauer, F, Yang, N, von Stumberg, L, Zeller, N, Cremers, D (2021) Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. In: IEEE conference on computer vision and pattern recognition, pp 6112–6122
Zhong Y, Qi Y, Gryaditskaya Y, Zhang H, Song Y-Z (2020) Towards practical sketch-based 3d shape generation: The role of professional sketches. IEEE Transactions on Circuits and Systems for Video Technology 31(9):3518–3528
Nie J, Wei Z-Q, Nie W, Liu A-A (2021) Pgnet: Progressive feature guide learning network for three-dimensional shape recognition. ACM Trans Multimed Comput Commun Appl 17(3):1–17
Ding, H, Jiang, X, Shuai, B, Liu, AQ, Wang, G (2019) Semantic correlation promoted shape-variant context for segmentation. In: IEEE conference on computer vision and pattern recognition, pp 8885–8894
Kuo, W, Angelova, A, Malik, J, Lin, T-Y (2019) Shapemask: Learning to segment novel objects by refining shape priors. In: IEEE international conference on computer vision, pp 9207–9216
Amirkhani D, Bastanfard A (2021) An objective method to evaluate exemplar-based inpainted images quality using jaccard index. Multimedia Tools and Applications 80(17):26199–26212
Wang X, Shen C, Li H, Xu S (2020) Human detection aided by deeply learned semantic masks. IEEE Trans. Circuits Syst Video Technol 30(8):2663–2673
Jiang S, Lu X, Lei Y, Liu L (2020) Mask-aware networks for crowd counting. IEEE Transactions on Circuits and Systems for Video Technology 30(9):3119–3129
Mao A, Liang Y, Jiao J, Liu Y, He S (2022) Mask-guided deformation adaptive network for human parsing. ACM Trans Multimed Comput Commun Appl 18(1):1–20
Wang X, Tian Y, Zhao X, Yang T, Gelernter J, Wang J, Cheng G, Hu W (2020) Improving multiperson pose estimation by mask-aware deep reinforcement learning. ACM Transactions on Multimedia Computing, Communications, and Applications 16(3):1–18
Chen, L-C, Zhu, Y, Papandreou, G, Schroff, F, Adam, H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the european conference on computer vision, pp 801–818
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Ronneberger, O, Fischer, P, Brox, T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241
Long, J, Shelhamer, E, Darrell, T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition, pp 3431–3440
Wang, X, Girshick, R, Gupta, A, He, K (2018) Non-local neural networks. In: IEEE conference on computer vision and pattern recognition, pp 7794–7803
Chen, L-C, Papandreou, G, Schroff, F, Adam, H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Nguyen, A, Kanoulas, D, Caldwell, DG, Tsagarakis, NG (2016) Detecting object affordances with convolutional neural networks. In: 2016 IEEE/RSJ international conference on intelligent robots and systems, pp 2765–2770
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4):834–848
Yin C, Zhang Q (2022) Object affordance detection with boundary-preserving network for robotic manipulation tasks. Neural Comput & Applic 34(20):17963–17980
Zhang, Y, Li, H, Ren, T, Dou, Y, Li, Q (2022) Multi-scale fusion and global semantic encoding for affordance detection. In: 2022 International joint conference on neural networks, pp 1–8
Zheng, G, Zhang, F, Zheng, Z, Xiang, Y, Yuan, NJ, Xie, X, Li, Z (2018) Drn: A deep reinforcement learning framework for news recommendation. In: The 2018 WWW Conference, pp 167–176
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 62172022, and U21B2038, in part by the Beijing Outstanding Young Scientists Project under Grant BJJWZYJH01201910005018.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing financial interests in the subject matter or materials discussed in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, D., Kong, D., Li, J. et al. ADOSMNet: a novel visual affordance detection network with object shape mask guided feature encoders. Multimed Tools Appl 83, 31629–31653 (2024). https://doi.org/10.1007/s11042-023-16898-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16898-2