[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3331184.3331308acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Evaluating Variable-Length Multiple-Option Lists in Chatbots and Mobile Search

Published: 18 July 2019 Publication History

Abstract

In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links" metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user's intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user's intent. However, this possibility should not be overused, as this practice could confuse and/or annoy the user. How to produce good variable-length lists, given the conflicting objectives of staying short while maximizing the likelihood of having a correct answer included in the list, is an underexplored problem. It is also unclear how to evaluate a system that tries to do that. Here we aim to bridge this gap. In particular, we define some necessary and some optional properties that an evaluation measure fit for this purpose should have. We further show that existing evaluation measures from the IR tradition are not entirely suitable for this setup, and we propose novel evaluation measures that address it satisfactorily.

References

[1]
A. Albahem, D. Spina, F. Scholer, A. Moffat, and L. Cavedon. 2018. Desirable Properties for Diversity and Truncated Effectiveness Metrics. In Proc. of ADCS.
[2]
E. Amigó, J. Gonzalo, and F. Verdejo. 2013. A General Evaluation Measure for Document Organization Tasks. In Proceedings of SIGIR.
[3]
E. Amigó, D. Spina, and J. Carrillo de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In Proceedings of SIGIR.
[4]
Z. Ashktorab, M. Jain, Q. V. Liao, and J. D Weisz. 2019. Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. In Proceedings of CHI.
[5]
L. Azzopardi, P. Thomas, and N. Craswell. 2018. Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure. In Proc. of SIGIR.
[6]
R. Baeza-Yates, A. Z. Broder, and Y. Maarek. 2011. The New Frontier of WebSearch Technology: Seven Challenges. In Search Computing. Springer, 3--9.
[7]
T. Bocklisch, J. Faulker, N. Pawlowski, and A. Nichol. 2017. Rasa: Open Source Language Understanding and Dialogue Management. (2017). arXiv:1712.05181.
[8]
D. Braun, A. Hernandez-Mendez, F. Matthes, and M. Langen. 2017. Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. In Proceedings of SIGDIAL.
[9]
M. Burtsev and other 19 authors. 2018. Deep Pavlov: Open-Source Library for Dialogue Systems. In Proceedings of ACL: System Demonstrations.
[10]
A. P. Chaves and M. A. Gerosa. 2019. How Should my Chatbot Interact? A Surveyon Human-Chatbot Interaction Design. (2019). arXiv:1904.02743 {cs.HC}.
[11]
A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau. 2018. Snips Voice Platform: An Embedded Spoken Language Understanding System for Private-by-Design Voice Interfaces. (2018). arXiv:1805.10190 {cs.CL}.
[12]
A. Følstad and P. Brandtzæg. 2017. Chatbots and the New World of HCI. Interactions 24, 4 (2017), 38--42.
[13]
K. Järvelin and J. Kekäläinen. 2000. IR Evaluation Methods for Retrieving Highly Relevant Documents. In Proceedings of SIGIR.
[14]
F. Liu, A. Moffat, T. Baldwin, and X. Zhang. 2016. Quit While Ahead: Evaluating Truncated Rankings. In Proceedings of SIGIR.
[15]
A. Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics. In Proceedings of AIRS.
[16]
A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems27, 1 (2008), 2.
[17]
A. Peñas and A. Rodrigo. 2011. A Simple Measure to Assess Non-Response. In Proceedings of ACL.
[18]
T. Russell-Rose and T. Tate. 2012.Designing the Search Experience: The Information Architecture of Discovery. Newnes.
[19]
T. Sakai. 2004. New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering. In Proceedings of NTCIR.
[20]
T. Sakai and Z. Dou. 2013. Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation. In Proceedings of SIGIR.
[21]
F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms. In Proceedings of ICTIR.
[22]
M. D. Smucker and C. Clarke. 2012. Modeling User Variance in Time-Biased Gain. In Proceedings of CHIIR.
[23]
E. M. Voorhees. 2001. Overview of the TREC 2001 Question Answering Track. In Proceedings of TREC.
[24]
J. D. Williams, E. Kamal, M. Ashour, H. Amr, J. Miller, and G. Zweig. 2015. Fast and Easy Language Understanding for Dialog Systems with Microsoft Language Understanding Intelligent Service (LUIS). In Proceedings of SIGDIAL.
[25]
P. Xu and R. Sarikaya. 2013. Exploiting Shared Information for Multi-Intent Natural Language Sentence Classification. In Proceedings of INTERSPEECH.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chatbots
  2. evaluation measures
  3. mobile search

Qualifiers

  • Short-paper

Funding Sources

  • SIGIR Student Travel Grant

Conference

SIGIR '19
Sponsor:

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 187
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media