[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3343413.3377959acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Data-Driven Evaluation Metrics for Heterogeneous Search Engine Result Pages

Published: 14 March 2020 Publication History

Abstract

Evaluation metrics for search typically assume items are homoge- neous. However, in the context of web search, this assumption does not hold. Modern search engine result pages (SERPs) are composed of a variety of item types (e.g., news, web, entity, etc.), and their influence on browsing behavior is largely unknown.
In this paper, we perform a large-scale empirical analysis of pop- ular web search queries and investigate how different item types influence how people interact on SERPs. We then infer a user browsing model given people's interactions with SERP items - creating a data-driven metric based on item type. We show that the proposed metric leads to more accurate estimates of: (1) total gain, (2) total time spent, and (3) stopping depth - without requiring extensive parameter tuning or a priori relevance information. These results suggest that item heterogeneity should be accounted for when de- veloping metrics for SERPs. While many open questions remain concerning the applicability and generalizability of data-driven metrics, they do serve as a formal mechanism to link observed user behaviors directly to how performance is measured. From this approach, we can draw new insights regarding the relationship between behavior and performance - and design data-driven metrics based on real user behavior rather than using metrics reliant on some hypothesized model of user browsing behavior.

References

[1]
Azzah Al-Maskari and Mark Sanderson. 2010. A Review of Factors Influencing User Satisfaction in Information Retrieval. JASIST, Vol. 61, 5 (May 2010), 859--868.
[2]
Ameer Albahem, Damiano Spina, Falk Scholer, Alistair Moffat, and Lawrence Cavedon. 2018. Desirable Properties for Diversity and Truncated Effectiveness Metrics. In Proceedings of the 23rd Australasian Document Computing Symposium (ADCS '18). Article Article 9, bibinfonumpages7 pages.
[3]
Enrique Amigó, Damiano Spina, and Jorge Carrillo-de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In Proc. of the 41st International ACM SIGIR Conference. 625--634.
[4]
Jaime Arguello, Robert Capra, and Wan Ching Wu. 2013. Factors affecting aggregated search coherence and search behavior. In Proc. SIGIR . 1989--1998.
[5]
Jaime Arguello, Wan-Ching Wu, Diane Kelly, and Ashlee Edwards. 2012. Task Complexity, Vertical Display and User Interaction in Aggregated Search. In Proc. of the 35th ACM SIGIR Conference. 435--444.
[6]
Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure. In Proc. of the 41st International ACM SIGIR Conference . 605--614.
[7]
Leif Azzopardi, Paul Thomas, and Alistair Moffat. 2019. Cwl_eval: An Evaluation Tool for Information Retrieval. In Proc. of the 42nd International ACM SIGIR Conference. 1321--1324.
[8]
Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, Ashwin Satyanarayana, and S.M.M. Tahaghoghi. 2010. Evaluating Whole-page Relevance. In Proc. of the 33rd International ACM SIGIR Conference. 767--768.
[9]
Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10.
[10]
Georg Buscher, Susan T. Dumais, and Edward Cutrell. 2010. The Good, the Bad, and the Random: An Eye-tracking Study of Ad Quality in Web Search. In Proc. of the 33rd ACM SIGIR Conferenc. 42--49.
[11]
Ben Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proc. of the 34th ACM SIGIR Conference . 903--912.
[12]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proc. of the 18th ACM CIKM Conference . 621--630.
[13]
Zhicong Cheng, Bin Gao, and Tie-Yan Liu. 2010. Actively Predicting Diverse Search Intent from User Browsing Behaviors. In Proc. of the 19th International Conference on World Wide Web (WWW '10). 221--230.
[14]
Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proc. of the 31st ACM SIGIR Conference. 659--666.
[15]
Birk Diedenhofen and Jochen Musch. 2015. cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLOS ONE, Vol. 10, 4 (04 2015), 1--12. https://doi.org/10.1371/journal.pone.0121945
[16]
Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007. A large-scale evaluation and analysis of personalized search strategies. In Proc. of the 16th WWW. 581--590.
[17]
Susan T. Dumais, Georg Buscher, and Edward Cutrell. 2010. Individual Differences in Gaze Patterns for Web Search. In Proc. of the 3rd IIiX Symposium . 185--194.
[18]
Norbert Fuhr. 2008. A probability ranking principle for IIR. Information Retrieval, Vol. 11, 3 (2008), 251--265.
[19]
Norbert Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 52, 2 (2017), 32--41.
[20]
Ahmed Hassan, Xiolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks: Query reformulation as a predictor of search satisfaction. In Proc. of the ACM CIKM Conference. 2019--2028.
[21]
Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, Vol. 10, 1 (2016), 1--117.
[22]
Jim Jansen, Danielle L. Booth, and Amanda Spink. 2007. Determining the user intent of web search engine queries. Proc. of the 16th International World Wide Web Conference, 1149--1150.
[23]
Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst., Vol. 20, 4 (Oct. 2002), 422--446.
[24]
Jiepu Jiang and James Allan. 2016a. Adaptive Effort for Search Evaluation Metrics. In Proc. of the 38th European Conference on IR Research. 187--199.
[25]
Jiepu Jiang and James Allan. 2016b. Correlation Between System and User Metrics in a Session. In Proc. of the ACM CHIIR Conference . 285--288.
[26]
Jiepu Jiang and James Allan. 2017. Adaptive Persistence for Search Effectiveness Measures. In Proc. of the ACM CIKM Conference (CIKM '17). 747--756.
[27]
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating Multi-query Sessions. In Proc. of the 34th ACM SIGIR Conference . 1053--1062.
[28]
Paul Kantor and Ellen Voorhees. 2000. The TREC-5 Confusion Track. Information Retrieval, Vol. 2, 2--3 (2000), 165--176.
[29]
Jaana Kekäläinen and Kalervo Järvelin. 2002. Using graded relevance assessments in IR evaluation. JASIST, Vol. 53, 13 (2002), 1120--1129.
[30]
Fei Liu, Alistair Moffat, Timothy Baldwin, and Xiuzhen Zhang. 2016. Quit While Ahead: Evaluating Truncated Rankings. In Proc. of the 39th International ACM SIGIR Conference (SIGIR '16). 953--956.
[31]
Lori Lorigo, Maya Haridasan, Hrönn Brynjarsdóttir, Ling Xia, Thorsten Joachims, Geri Gay, Laura Granka, Fabio Pellacini, and Bing Pan. 2008. Eye Tracking and Online Search. JASIST, Vol. 59, 7 (2008), 1041--1052.
[32]
Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating Mobile Search with Height-Biased Gain. In Proc. of the 40th International ACM SIGIR Conference. 435--444.
[33]
Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2015. INST: An Adaptive Metric for Information Retrieval Evaluation. In Proc. of the 20th ADCS Conference. Article 5, bibinfonumpages4 pages.
[34]
Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness. ACM Transactions on Information Systems, Vol. 35, 3, Article 24 (2017), bibinfonumpages38 pages.
[35]
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users Versus Models: What Observation Tells Us About Effectiveness Metrics. In Proc. of the 22nd ACM CIKM Conference. 659--668.
[36]
Alistair Moffat and Alfan Farizki Wicaksono. 2018. Users, Adaptivity, and Bad Abandonment. In Proc. of the 41st International ACM SIGIR Conference. 897--900.
[37]
Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Trans. on Information Systems, Vol. 27, 1 (2008), 2:1--2:27.
[38]
Stephen E. Robertson, Evangelos Kanoulas, and Emine Yilmaz. 2010. Extending Average Precision to Graded Relevance Judgments. In Proc. of the 33rd International ACM SIGIR Conference. 603--610.
[39]
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, Ranked Retrieval and Sessions: A Unified Framework for Info. Access Evaluation. In Proc. of the 36th ACM SIGIR Conference. 473--482.
[40]
Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, Vol. 4, 4 (2010), 247--375.
[41]
Mark D. Smucker and Charles L.A. Clarke. 2012. Time-Based Calibration of Effectiveness Measures. In Proc. of the 35th ACM SIGIR Conference . 95--104.
[42]
Jaime Teevan, Susan T Dumais, and Daniel J Liebling. 2008. To personalize or not to personalize: modeling queries with variation in user intent. In Proc. of the 31st international ACM SIGIR Conference. ACM, 163--170.
[43]
Ellen Voorhees and Donna Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval .The MIT press.
[44]
Alfan Farizki Wicaksono and Alistair Moffat. 2018. Empirical Evidence for Search Effectiveness Models. In Proc. of the 27th ACM CIKM Conference . 1571--1574.
[45]
Xiaohui Xie, Jiaxin Mao, Maarten de Rijke, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2018. Constructing an Interaction Behavior Model for Web Image Search. In Proc. of the 41st International ACM SIGIR Conference. 425--434.
[46]
Grace Hui Yang, Xuchu Dong, Jiyun Luo, and Sicong Zhang. 2018. Session search modeling by partially observable Markov decision process. Information Retrieval Journal, Vol. 21, 1 (2018), 56--80.
[47]
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM SIGIR Conference. 1561--1564.
[48]
Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating Web Search with a Bejeweled Player Model. In Proc. of the 40th ACM SIGIR Conference. 425--434.

Cited By

View all
  • (2024)Probabilistic graph model and neural network perspective of click models for web searchKnowledge and Information Systems10.1007/s10115-024-02145-z66:10(5829-5873)Online publication date: 6-Jun-2024
  • (2023)Query sampler: generating query sets for analyzing search engines using keyword research toolsPeerJ Computer Science10.7717/peerj-cs.14219(e1421)Online publication date: 7-Jun-2023
  • (2023)Desktop Search EnginesProtecting User Privacy in Web Search Utilization10.4018/978-1-6684-6914-9.ch004(63-96)Online publication date: 3-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval
March 2020
596 pages
ISBN:9781450368926
DOI:10.1145/3343413
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation
  2. information retrieval
  3. measures
  4. metrics

Qualifiers

  • Research-article

Conference

CHIIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 55 of 163 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Probabilistic graph model and neural network perspective of click models for web searchKnowledge and Information Systems10.1007/s10115-024-02145-z66:10(5829-5873)Online publication date: 6-Jun-2024
  • (2023)Query sampler: generating query sets for analyzing search engines using keyword research toolsPeerJ Computer Science10.7717/peerj-cs.14219(e1421)Online publication date: 7-Jun-2023
  • (2023)Desktop Search EnginesProtecting User Privacy in Web Search Utilization10.4018/978-1-6684-6914-9.ch004(63-96)Online publication date: 3-Mar-2023
  • (2023)Practice and Challenges in Building a Business-oriented Search Engine Quality MetricProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591841(3295-3299)Online publication date: 19-Jul-2023
  • (2022)The Dark Side of Relevance: The Effect of Non-Relevant Results on Search BehaviorProceedings of the 2022 Conference on Human Information Interaction and Retrieval10.1145/3498366.3505770(1-11)Online publication date: 14-Mar-2022
  • (2022)A Flexible Framework for Offline Effectiveness MetricsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531924(578-587)Online publication date: 6-Jul-2022
  • (2021)Who does SEO in Spain? A cybermetric methodology for the construction of company universesEl Profesional de la información10.3145/epi.2021.jul.19Online publication date: 25-Aug-2021
  • (2021)ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472239(231-237)Online publication date: 11-Jul-2021
  • (2021)User Models, Metrics and Measures of SearchProceedings of the 2021 Conference on Human Information Interaction and Retrieval10.1145/3406522.3446049(347-348)Online publication date: 14-Mar-2021
  • (2021)Modeling search and session effectivenessInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10260158:4Online publication date: 1-Jul-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media