[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Overview of the seventh Dialog System Technology Challenge: : DSTC7

Published: 01 July 2020 Publication History

Highlights

DSTC7: Dialog Challenge to build more robust and accurate end-to-end dialog systems.
Track 1, Sentence selection for multiple domains, including variations where there are a large number of candidate options, and where the candidate set has zero, one, or multiple correct options.
Track 2, Beyond Chitchat: Generation of informational responses grounded in external knowledge.
Track 3, Audio visual scene-aware dialog systems to allow dynamic conversations about objects and events around users.

Abstract

This paper provides detailed information about the seventh Dialog System Technology Challenge (DSTC7) and its three tracks aimed to explore the problem of building robust and accurate end-to-end dialog systems. In more detail, DSTC7 focuses on developing and exploring end-to-end technologies for the following three pragmatic challenges: (1) sentence selection for multiple domains, (2) generation of informational responses grounded in external knowledge, and (3) audio visual scene-aware dialog to allow conversations with users about objects and events around them.
This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks, provided datasets and annotations, overview of the submitted systems and their final results. For Track 1, LSTM-based models performed best across both datasets, allowing teams to effectively handle task variants where no correct answer was present or when multiple paraphrases were included. For Track 2, RNN-based architectures augmented to incorporate facts by using two types of encoders: a dialog encoder and a fact encoder plus using attention mechanisms and a pointer-generator approach provided the best results. Finally, for Track 3, the best model used Hierarchical Attention mechanisms to combine the text and vision information obtaining a 22% better result than the baseline LSTM system for the human rating score.
More than 220 participants were registered and about 40 teams participated in the final challenge. 32 scientific papers reporting the systems submitted to DSTC7, and 3 general technical papers for dialog technologies, were presented during the one-day wrap-up workshop at AAAI-19. During the workshop, we reviewed the state-of-the-art systems, shared novel approaches to the DSTC7 tasks, and discussed the future directions for the challenge (DSTC8).

References

[1]
H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, et al., Audio visual scene-aware dialog, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[2]
Alamri, H., Cartillier, V., Lopes, R.G., Das, A., Wang, J., Essa, I., et al., 2018a. Audio visual scene-aware dialog (AVSD) challenge at DSTC7. arXiv:1806.00525.
[3]
H. Alamri, C. Hori, T.K. Marks, D. Batra, D. Parikh, Audio visual scene-aware dialog (AVSD) track for natural language generation in DSTC7, DSTC7 at AAAI2019 Workshop, 2018.
[4]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, et al., VQA: visual question answering, Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425–2433.
[5]
D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, Proc. of the International Conference on Learning Representations (ICLR), 2015.
[6]
R.E. Banchs, L.F. D’Haro, H. Li, Adequacy–fluency metrics: evaluating MT in the continuous space model framework, IEEE/ACM Trans. Audio SpeechLang. Process. 23 (3) (2015) 472–482.
[7]
J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[8]
Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced LSTM for natural language inference, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1, 2017, pp. 1657–1668,.
[9]
Q.Q. Chen, W. Wang, Sequential attention-based network for noetic end-to-end response selection, 7th Edition of the Dialog System Technology Challenges at AAAI 2019, 2019.
[10]
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J.M.F. Moura, et al., Visual dialog, CoRR (2016) arXiv:1611.08669.
[11]
A. Das, S. Kottur, J.M. Moura, S. Lee, D. Batra, Learning cooperative visual dialog agents with deep reinforcement learning, Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2951–2960.
[12]
L.F. D’Haro, R.E. Banchs, C. Hori, H. Li, Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics, Comput. Speech Lang. 55 (2019) 200–215.
[13]
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186.
[14]
E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, et al., The Second Conversational Intelligence Challenge (ConvAI2), in: S. Escalera, R. Herbrich (Eds.), The Springer Series on Challenges in Machine Learning, Springer, Cham, 2020.
[15]
G. Doddington, Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, Proceedings of the Second International Conference on Human Language Technology Research, 2002, pp. 138–145.
[16]
J. Ganhotra, S.S. Patel, K.P. Fadnis, Knowledge-incorporating ESIM models for response selection in retrieval-based dialog systems, 7th Edition of the Dialog System Technology Challenges at AAAI 2019, 2019.
[17]
X. Gao, S. Lee, Y. Zhang, C. Brockett, M. Galley, J. Gao, et al., Jointly optimizing diversity and relevance in neural response generation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
[18]
M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, et al., A knowledge-grounded neural conversation model, AAAI (2018).
[19]
J. Gu, Z. Lu, H. Li, V.O. Li, Incorporating copying mechanism in sequence-to-sequence learning, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1631–1640,.
[20]
S. He, C. Liu, K. Liu, J. Zhao, Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning, ACL, 1, 2017, pp. 199–208.
[21]
M. Henderson, B. Thomson, J.D. Williams, The second dialog state tracking challenge, Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 263–272.
[22]
M. Henderson, B. Thomson, J.D. Williams, The third dialog state tracking challenge, Spoken Language Technology Workshop (SLT), 2014 IEEE, IEEE, 2014, pp. 324–329.
[23]
S. Hershey, S. Chaudhuri, D.P. Ellis, J.F. Gemmeke, A. Jansen, R.C. Moore, et al., CNN architectures for large-scale audio classification, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–135.
[24]
R. Higashinaka, L.F. D’Haro, B.A. Shawar, R. Banchs, K. Funakoshi, M. Inaba, et al., Overview of the dialogue breakdown detection challenge 4, 10th International Workshop on Spoken Dialog Systems (IWSDS), 2019.
[25]
C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, et al., End-to-end audio visual scene-aware dialog using multimodal attention-based video features, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2352–2356. arXiv:1806.08409.
[26]
C. Hori, T. Hori, End-to-end conversation modeling track in DSTC6, Dialog System Technology Challenges 6, 2017, arXiv:1706.07440.
[27]
C. Hori, T. Hori, A. Cherian, T.K. Marks, Joint student-teacher learning for audio-visual scene-aware dialog, Interspeech 2019, ISCA, 2019, pp. 1886–1890.
[28]
C. Hori, T. Hori, T.-Y. Lee, Z. Zhang, B. Harsham, J.R. Hershey, et al., Attention-based multimodal fusion for video description, Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4193–4202.
[29]
C. Hori, J. Perez, R. Higashinaka, T. Hori, Y.-L. Boureau, M. Inaba, et al., Overview of the sixth dialog system technology challenge: DSTC6, Comput. Speech Lang. 55 (2019) 1–25.
[30]
Y. Jiang, J.K. Kummerfeld, W.S. Lasecki, Understanding task design trade-offs in crowdsourced paraphrase collection, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 103–109,.
[31]
Khatri, C., Hedayatnia, B., Venkatesh, A., Nunn, J., Pan, Y., Liu, Q., et al., 2018. Advancing the state of the art in open domain dialog systems through the Alexa prize. arXiv:1812.10757.
[32]
S. Kim, L.F. D’Haro, R.E. Banchs, J.D. Williams, M. Henderson, The fourth dialog state tracking challenge, Dialogues with Social Robots, Springer, 2017, pp. 435–449.
[33]
S. Kim, L.F. D’Haro, R.E. Banchs, J.D. Williams, M. Henderson, K. Yoshino, The fifth dialog state tracking challenge, 2016 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2016, pp. 511–517.
[34]
D.P. Kingma, M. Welling, Auto-encoding variational bayes, CoRR abs/1312.6114 (2013).
[35]
S.H. Kumar, E. Okur, S. Sahay, J.J.A. Leanos, J. Huang, L. Nachman, Context, attention and audio feature explorations for audio visual scene-aware dialoge, DSTC7 at AAAI2019 workshop, 2019.
[36]
J.K. Kummerfeld, Slate: a super-lightweight annotation tool for experts, Proceedings of ACL 2019, System Demonstrations, 2019.
[37]
Kummerfeld, J.K., Gouravajhala, S.R., Peper, J., Athreya, V., Gunasekara, C., Ganhotra, J., et al., 2018. Analyzing assumptions in conversation disentanglement research through the lens of a new dataset and model. arXiv:1810.11118.
[38]
A. Lavie, A. Agarwal, METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments, Proc. of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, Stroudsburg, PA, USA, 2007, pp. 228–231.
[39]
H. Le, S. Hoi, D. Sahoo, N. Chen, End-to-end multimodal dialog systems with hierarchical multimodal attention on video features, DSTC7 at AAAI2019 workshop, 2019.
[40]
J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A diversity-promoting objective function for neural conversation models, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 110–119.
[41]
K.-Y. Lin, C.-C. Hsu, Y.-N. Chen, L.-W. Ku, Entropy-enhanced multimodal attention model for scene-aware dialogue generation, DSTC7 at AAAI2019 workshop, 2019.
[42]
R. Lowe, N. Pow, I. Serban, J. Pineau, The UBUNTU dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems, Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, Prague, Czech Republic, 2015, pp. 285–294.
[43]
D. Nguyen, S. Sharma, H. Schulz, L.E. Asri, From film to video: multi-turn question answering with multi-modal context, DSTC7 at AAAI2019 Workshop, 2019.
[44]
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 311–318.
[45]
R.R. Pasunuru, M. Bansal, DSTC7-AVSD: scene-aware video-dialogue systems with dual attention, DSTC7 at AAAI2019 workshop, 2019.
[46]
J. Perez, Y.-L. Boureau, A. Bordes, Dialog system technology challenge 6 overview of track 1 - end-to-end goal-oriented dialog learning, Dialog System Technology Challenges 6, 2017.
[47]
M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, et al., Deep contextualized word representations, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227–2237,.
[48]
L. Qin, M. Galley, C. Brockett, X. Liu, X. Gao, B. Dolan, et al., Conversing by reading: contentful neural conversation with on-demand machine reading, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 5427–5436,.
[49]
A. Ritter, C. Cherry, W.B. Dolan, Data-driven response generation in social media, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 583–593.
[50]
S. Ruder, M.E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language processing, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 2019, pp. 15–18.
[51]
R. Sanabria, S. Palaskar, F. Metze, CMU sinbad submission for the DSTC7 AVSD challenge, DSTC7 at AAAI2019 workshop, 2019.
[52]
A. See, P.J. Liu, C.D. Manning, Get to the point: summarization with pointer-generator networks, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1073–1083,.
[53]
I.V. Serban, R. Lowe, P. Henderson, L. Charlin, J. Pineau, A survey of available corpora for building data-driven dialogue systems: the journal version, Dialogue Discourse 9 (1) (2018) 1–49,.
[54]
I.V. Serban, A. Sordoni, Y. Bengio, A. Courville, J. Pineau, Building end-to-end dialogue systems using generative hierarchical neural network models, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, 2016, pp. 3776–3783.
[55]
L. Shang, Z. Lu, H. Li, Neural responding machine for short-text conversation, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, 2015, pp. 1577–1586,.
[56]
S. Sharma, L. El Asri, H. Schulz, J. Zumer, Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation, CoRR abs/1706.09799 (2017).
[57]
Sigurdsson, G.A., Varol, G., Wang, X., Laptev, I., Farhadi, A., Gupta, A., 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. European Conference on Computer Vision. arXiv:1604.01753.
[58]
A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, et al., A neural network approach to context-sensitive generation of conversational responses, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Denver, Colorado, 2015, pp. 196–205,.
[59]
S. Sukhbaatar, A. szlam, J. Weston, R. Fergus, End-to-end memory networks, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., 2015, pp. 2440–2448.
[60]
R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image description evaluation, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 4566–4575.
[61]
O. Vinyals, Q. Le, A neural conversational model, ICML (2015).
[62]
J. Weston, S. Chopra, A. Bordes, Memory networks, ICLR (2015).
[63]
J. Williams, A. Raux, D. Ramachandran, A. Black, The dialog state tracking challenge, Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 404–413.
[64]
Y.-T. Yeh, T.-C. Lin, H.-H. Cheng, Y.-H. Deng, S.-Y. Su, Y.-N. Chen, Reactive multi-stage feature fusion for multimodal dialogue modeling, DSTC7 at AAAI2019 Workshop, 2019.
[65]
B. Zhuang, W. Wang, T. Shinozaki, Investigation of attention-based multimodal fusion and maximum mutual information objective for DSTC7 track3, DSTC7 at AAAI2019 Workshop, 2019.

Cited By

View all
  • (2023)A Primer on Seq2Seq Models for Generative ChatbotsACM Computing Surveys10.1145/360428156:3(1-58)Online publication date: 6-Oct-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Speech and Language
Computer Speech and Language  Volume 62, Issue C
Jul 2020
139 pages

Publisher

Academic Press Ltd.

United Kingdom

Publication History

Published: 01 July 2020

Author Tags

  1. Dialog System Technology Challenge
  2. end-to-end dialog systems
  3. Sentence Selection
  4. Natural Language Generation
  5. Audio Visual Scene-Aware Dialog

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Primer on Seq2Seq Models for Generative ChatbotsACM Computing Surveys10.1145/360428156:3(1-58)Online publication date: 6-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media