[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3331184.3331226acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

User Attention-guided Multimodal Dialog Systems

Published: 18 July 2019 Publication History

Abstract

As an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. Indeed, the desire for multimodal task-oriented dialog systems is growing with the rapid expansion of many domains, such as the online retailing and travel. Besides, few work considers the hierarchical product taxonomy and the users' attention to products explicitly. The fact is that users tend to express their attention to the semantic attributes of products such as color and style as the dialog goes on. Towards this end, in this work, we present a hierarchical User attention-guided Multimodal Dialog system, named UMD for short. UMD leverages a bidirectional Recurrent Neural Network to model the ongoing dialog between users and chatbots at a high level; As to the low level, the multimodal encoder and decoder are capable of encoding multimodal utterances and generating multimodal responses, respectively. The multimodal encoder learns the visual presentation of images with the help of a taxonomy-attribute combined tree, and then the visual features interact with textual features through an attention mechanism; whereas the multimodal decoder selects the required visual images and generates textual responses according to the dialog history. To evaluate our proposed model, we conduct extensive experiments on a public multimodal dialog dataset in the retailing domain. Experimental results demonstrate that our model outperforms the existing state-of-the-art methods by integrating the multimodal utterances and encoding the visual features based on the users' attribute-level attention.

Supplementary Material

MP4 File (cite2-13h50-d2.mp4)

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. IEEE, 2425--2433.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
[3]
Antoine Bordes, Y.-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.
[4]
Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, Vol. 19, 2, 25--35.
[5]
Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. 2018. Dialogue act recognition via crf-attentive structured network. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 225--234.
[6]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1089.
[7]
Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In Proceedings of the 55th Annual Meeting of the Association for Computational. ACL, 484--495.
[8]
Jimmy Lei Ba. Diederik P. Kingma. 2015. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
[9]
Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5110--5117.
[10]
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1865--1873.
[11]
Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. SIGDIAL, 129--133.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 770--778.
[13]
Sungjin Lee and Maxine Eskenazi. 2013. Recipe for building robust spoken dialog state trackers: Dialog state tracking challenge system description. In Proceedings of the 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue. SIGDIAL, 414--422.
[14]
Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1437--1447.
[15]
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. ACL, 110--119.
[16]
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016b. Deep Reinforcement Learning for Dialogue Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1192--1202.
[17]
Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Çelikyilmaz. 2017. End-to-End Task-Completion Neural Dialogue Systems. In Proceedings of the 8th International Joint Conference on Natural Language Processing. AFNLP, 733--743.
[18]
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 801--809.
[19]
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2122--2132.
[20]
Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. 2019. Online data organizer: micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing, Vol. 28, 3 (2019), 1235--1247.
[21]
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2017. Coherent Dialogue with Attention-Based Language Models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3252--3258.
[22]
Nikola Mrksic, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve J. Young. 2017. Neural Belief Tracker: Data-Driven Dialogue State Tracking. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1777--1788.
[23]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of Annual Meeting of the Association for Computational Linguistics. ACL, 311--318.
[24]
J. Randolph. 2005. Free-Marginal Multirater Kappa (multirater ? free): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. Joensuu Learning and Instruction Symposium.
[25]
Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. ACL, 438--449.
[26]
Amrita Saha, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press.
[27]
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, 3776--3784.
[28]
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3295--3301.
[29]
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding Machine for Short-Text Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Natural Language Processing. ACL, 1577--1586.
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[31]
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 2440--2448.
[32]
Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 235--244.
[33]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 3104--3112.
[34]
Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat More: Deepening and Widening the Chatting Topic via A Deep Model. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 255--264.
[35]
Jason D. Williams and Geoffrey Zweig. 2016. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269.
[36]
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic Aware Neural Response Generation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3351--3357.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International conference on machine learning. JMLR.org, 2048--2057.
[38]
Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 55--64.
[39]
Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 685--694.
[40]
Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. 2017. Building Task-Oriented Dialogue Systems for Online Shopping. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 4618--4626.
[41]
Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245--254.
[42]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 1839--1848.
[43]
Zheng Zhang, Lizi Liao, Minlie Huang, Xiaoyan Zhu, and Tat-Seng Chua. 2019. Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems. In Proceedings of the 28th International Conference on World Wide Web. ACM.

Cited By

View all
  • (2024)Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data SettingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681217(2223-2232)Online publication date: 28-Oct-2024
  • (2024)Engaging Live Video Comments GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681195(8034-8042)Online publication date: 28-Oct-2024
  • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multimodal dialog systems
  2. multimodal response generation
  3. multimodal utterance encoder
  4. taxonomy-attribute combined tree

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Project of Thousand Youth Talents 2016
  • Tencent AI Lab Rhino-Bird Joint Research Program

Conference

SIGIR '19
Sponsor:

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)7
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data SettingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681217(2223-2232)Online publication date: 28-Oct-2024
  • (2024)Engaging Live Video Comments GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681195(8034-8042)Online publication date: 28-Oct-2024
  • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
  • (2024)MulmQA: Multimodal Question Answering for Database Alarm2024 5th Information Communication Technologies Conference (ICTC)10.1109/ICTC61510.2024.10602092(291-296)Online publication date: 10-May-2024
  • (2023)Intelligent Computing: The Latest Advances, Challenges, and FutureIntelligent Computing10.34133/icomputing.00062Online publication date: 30-Jan-2023
  • (2023)Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language ModelACM Transactions on Information Systems10.1145/360636842:2(1-25)Online publication date: 6-Oct-2023
  • (2023)Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613755(6491-6500)Online publication date: 26-Oct-2023
  • (2023)MaTCR: Modality-Aligned Thought Chain Reasoning for Multimodal Task-Oriented Dialogue GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612268(5776-5785)Online publication date: 26-Oct-2023
  • (2023)Dual Semantic Knowledge Composed Multimodal Dialog SystemsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591673(1518-1527)Online publication date: 19-Jul-2023
  • (2023)End-to-End Dialogue Generation Using a Single Encoder and a Decoder Cascade With a Multidimension Attention MechanismIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.315134734:11(8482-8492)Online publication date: Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media