[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3591106.3592272acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Knowledge-Aware Causal Inference Network for Visual Dialog

Published: 12 June 2023 Publication History

Abstract

The effective knowledge and interaction within multi-modalities are key to Visual Dialog. Classic graph-based framework with the direct connection between history dialog and answer fails to give the right answer for the spurious guidance and strong bias induced from history dialog. Recent causal inference framework without this direct connection improves the generalization while worse accuracy. In this work, we propose a novel Knowledge-Aware Causal Inference framework(KACI-Net) in which the commonsense knowledge is introduced into the causal inference framework to achieve both high accuracy and generalization. Specifically, the commonsense knowledge is first generated according to the entities extracted from the question and fused with language and visual features with the co-attention to get the final answer. Comparisons with knowledge-unaware framework and graph-based knowledge-aware framework on VisDial v1.0 dataset show the superiority of our proposed framework and verify the effectiveness the usage of the commonsense knowledge for a good reasoning in Visual Dialog. Both high NDCG and MRR metrics indicate a good trade-off between accuracy and generalization.

References

[1]
S. Agarwal, T. Bui, J. Y. Lee, I. Konstas, and V. Rieser. 2020. History for Visual Dialog: Do we really need it?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
[3]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, and D. Parikh. 2015. VQA: Visual Question Answering. International Journal of Computer Vision 123, 1 (2015), 4–31.
[4]
Ning Bian, Xianpei Han, Bo Chen, and Le Sun. 2021. Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation. CoRR abs/2101.00760 (2021). arXiv:2101.00760https://arxiv.org/abs/2101.00760
[5]
Cheng Chen, Zhenshan Tan, Qingrong Cheng, Xin Jiang, Qun Liu, Yudong Zhu, and Xiaodong Gu. 2022. UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18082–18091. https://doi.org/10.1109/CVPR52688.2022.01757
[6]
Feilong Chen, Xiuyi Chen, Fandong Meng, Peng Li, and Jie Zhou. 2021. GoG: Relation-aware Graph-over-Graph Network for Visual Dialog. arxiv:2109.08475 [cs.CL]
[7]
Feilong Chen, Xiuyi Chen, Can Xu, and Daxin Jiang. 2021. Learning to Ground Visual Objects for Visual Dialog. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 1081–1091. https://aclanthology.org/2021.findings-emnlp.93
[8]
Feilong Chen, Xiuyi Chen, Shuang Xu, and Bo Xu. 2022. Improving Cross-Modal Understanding in Visual Dialog Via Contrastive Learning. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7937–7941. https://doi.org/10.1109/ICASSP43922.2022.9747769
[9]
Tangming Chen, Lianli Gao, Xiangpeng Li, Lei Zhao, and Jingkuan Song. 2022. Context Gating with Multi-Level Ranking Learning for Visual Dialog. In 2022 IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME52920.2022.9859849
[10]
Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z. Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-Shot Visual Question Answering Using Knowledge Graph. In The Semantic Web – ISWC 2021, Andreas Hotho, Eva Blomqvist, Stefan Dietze, Achille Fokoue, Ying Ding, Payam Barnaghi, Armin Haller, Mauro Dragoni, and Harith Alani (Eds.). Springer International Publishing, Cham, 146–162.
[11]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition. 326–335.
[12]
Dan Guo, Hui Wang, and Meng Wang. 2019. Dual Visual Attention Network for Visual Dialog. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4989–4995. https://doi.org/10.24963/ijcai.2019/693
[13]
Tianling Jiang, Yi Ji, Chunping Liu, and Hailin Shao. 2020. Visual-Textual Alignment for Graph Inference in Visual Dialog. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1874–1885. https://doi.org/10.18653/v1/2020.coling-main.170
[14]
Xiaoze Jiang, Siyi Du, Zengchang Qin, Yajing Sun, and Jing Yu. 2020. KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue. arxiv:2008.04858 [cs.CV]
[15]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[16]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl
[17]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195. https://doi.org/10.3233/SW-140134
[18]
Bingqian Lin, Yi Zhu, and Xiaodan Liang. 2021. Heterogeneous Excitation-and-Squeeze Network for visual dialog. Neurocomputing 449 (2021), 399–410. https://doi.org/10.1016/j.neucom.2021.03.104
[19]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, and C. L. Zitnick. 2014. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision (2014), 740–755.
[20]
Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. 2020. Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs. arxiv:1911.11390 [cs.CV]
[21]
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12700–12710.
[22]
Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive visual attention in visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6679–6688.
[23]
Sungjin Park, Taesun Whang, Yeochan Yoon, and Heuiseok Lim. 2020. Multi-View Attention Network for Visual Dialog. arxiv:2004.14025 [cs.AI]
[24]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[25]
Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two Causal Principles for Improving Visual Dialog. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 10857–10866. https://doi.org/10.1109/CVPR42600.2020.01087
[26]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39, 6 (2017), 1137–1149.
[27]
Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3027–3035. https://doi.org/10.1609/aaai.v33i01.33013027
[28]
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence.
[29]
Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, and Steven C.H. Hoi. 2020. VD-BERT: A Unified Vision and Dialog Transformer with BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3325–3338. https://doi.org/10.18653/v1/2020.emnlp-main.269
[30]
Yu Wang, Yilin Shen, and Hongxia Jin. 2020. An Interpretable Multimodal Visual Question Answering System Using Attention-Based Weighted Contextual Features. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems (Auckland, New Zealand) (AAMAS ’20). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2038–2040.
[31]
Yichong Xu, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, and Xuedong Huang. 2021. Fusing Context Into Knowledge Graph for Commonsense Question Answering. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1201–1207. https://doi.org/10.18653/v1/2021.findings-acl.102
[32]
Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making history matter: History-advantage sequence training for visual dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2561–2569.
[33]
Tong Ye, Shijing Si, Jianzong Wang, Rui Wang, Ning Cheng, and Jing Xiao. 2022. VU-BERT: A Unified Framework for Visual Dialog. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6687–6691. https://doi.org/10.1109/ICASSP43922.2022.9746098
[34]
Wenhao Yu, Chenguang Zhu, Lianhui Qin, Zhihan Zhang, Tong Zhao, and Meng Jiang. 2022. Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 1896–1906. https://doi.org/10.18653/v1/2022.findings-acl.149
[35]
Shunyu Zhang, Xiaoze Jiang, Zequn Yang, Tao Wan, and Zengchang Qin. 2022. Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4599–4608. https://doi.org/10.1109/CVPRW56347.2022.00506
[36]
Lei Zhao, Lianli Gao, Yuyu Guo, Jingkuan Song, and Hengtao Shen. 2021. SKANet: Structured Knowledge-Aware Network for Visual Dialog. In 2021 IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME51207.2021.9428279
[37]
Lei Zhao, Haonan Zhang, Xiangpeng Li, Sen Yang, and Yuanfeng Song. 2022. You should know more: Learning external knowledge for visual dialog. Neurocomputing 488 (2022), 54–65. https://doi.org/10.1016/j.neucom.2021.10.121

Cited By

View all
  • (2024)Caption-Aware Multimodal Relation Extraction with Mutual Information MaximizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681219(1148-1157)Online publication date: 28-Oct-2024
  • (2024)Joint-Motion Mutual Learning for Pose Estimation in VideoProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681179(8962-8971)Online publication date: 28-Oct-2024
  • (2024)Infer unseen from seenJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10396197:COnline publication date: 27-Feb-2024

Index Terms

  1. Knowledge-Aware Causal Inference Network for Visual Dialog

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
    June 2023
    694 pages
    ISBN:9798400701788
    DOI:10.1145/3591106
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Causality
    2. Commonsense Knowledge
    3. Visual Dialog

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICMR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Caption-Aware Multimodal Relation Extraction with Mutual Information MaximizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681219(1148-1157)Online publication date: 28-Oct-2024
    • (2024)Joint-Motion Mutual Learning for Pose Estimation in VideoProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681179(8962-8971)Online publication date: 28-Oct-2024
    • (2024)Infer unseen from seenJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10396197:COnline publication date: 27-Feb-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media