Abstract
Explainability and interpretability of deep neural networks have become of crucial importance over the years in Computer Vision, concurrently with the need to understand increasingly complex models. This necessity has fostered research on approaches that facilitate human comprehension of neural methods. In this work, we propose an explainable setting for visual navigation, in which an autonomous agent needs to explore an unseen indoor environment while portraying and explaining interesting scenes with natural language descriptions. We combine recent advances in ongoing research fields, employing an explainability method on images generated through agent-environment interaction. Our approach uses explainable maps to visualize model predictions and highlight the correlation between the observed entities and the generated words, to focus on prominent objects encountered during the environment exploration. The experimental section demonstrates that our approach can identify the regions of the images that the agent concentrates on to describe its point of view, improving explainability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Anjomshoae, S., Najjar, A., Calvaresi, D., Främling, K.: Explainable agents and robots: results from a systematic literature review. In: AAMAS (2019)
Bigazzi, R., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Embodied agents for efficient exploration and smart scene description. In: ICRA (2023)
Bigazzi, R., Landi, F., Cascianelli, S., Baraldi, L., Cornia, M., Cucchiara, R.: Focus on impact: indoor exploration with intrinsic motivation. RA-L 7(2), 2985–2992 (2022)
Bigazzi, R., Landi, F., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Explore and explain: self-supervised navigation and recounting. In: ICPR (2020)
Bigazzi, R., Landi, F., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Out of the box: embodied navigation in the real world. In: CAIP (2021)
Bolelli, F., Baraldi, L., Pollastri, F., Grana, C.: A hierarchical quasi-recurrent approach to video captioning. In: IPAS (2018)
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355 (2018)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2019)
Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. In: NeurIPS (2020)
Cornia, M., Baraldi, L., Cucchiara, R.: SMArT: training shallow memory-aware transformers for robotic explainability. In: ICRA (2020)
Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis. AI Commun. 35(2), 111–129 (2022)
Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal captioner: inducing content-style separation in vision-and-language model training. arXiv preprint arXiv:2111.12727 (2022)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR (2020)
Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_1
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: HIVE: evaluating the human interpretability of visual explanations. In: Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2016. LNCS, vol. 13672, pp. 280–298. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19775-8_17
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lovino, M., Bontempo, G., Cirrincione, G., Ficarra, E.: Multi-omics classification on kidney samples exploiting uncertainty-aware models. In: ICIC (2020)
Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: ICML (2017)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)
Poppi, S., Cornia, M., Baraldi, L., Cucchiara, R.: Revisiting the evaluation of class activation mapping for explainability: a novel metric and experimental analysis. In: CVPR Workshops (2021)
Poppi, S., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Multi-class explainable unlearning for image classification via weight filtering. arXiv preprint arXiv:2304.02049 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Occupancy anticipation for efficient exploration and navigation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 400–418. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_24
Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration. IJCV 129, 1616–1649 (2021)
Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-augmented contrastive learning for image and video captioning evaluation. In: CVPR (2023)
Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Retrieval-augmented transformer for image captioning. In: CBMI (2022)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
Sennrich, R., Haddow, B., Birch, A.: neural machine translation of rare words with subword units. In: ACL (2016)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: a survey on deep learning-based image captioning. IEEE Trans. PAMI 45(1), 539–559 (2022)
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: ICML (2017)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: CVPR Workshops (2020)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. In: ICLR (2019)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Acknowledgements
This work has been supported by the “Fit for Medical Robotics” (Fit4MedRob) project, funded by the Italian Ministry of University and Research, and by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 955778 for project “Personalized Robotics as Service Oriented Applications” (PERSEO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Poppi, S. et al. (2023). Towards Explainable Navigation and Recounting. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-43148-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43147-0
Online ISBN: 978-3-031-43148-7
eBook Packages: Computer ScienceComputer Science (R0)