More Web Proxy on the site http://driver.im/

research-article

Fostering Coopetition While Plugging Leaks: The Design and Implementation of the MS MARCO Leaderboards

Authors:

Emine YilmazAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2939 - 2948

https://doi.org/10.1145/3477495.3531725

Published: 07 July 2022 Publication History

Abstract

We articulate the design and implementation of the MS MARCO document ranking and passage ranking leaderboards. In contrast to "standard" community-wide evaluations such as those at TREC, which can be characterized as simultaneous games, leaderboards represent sequential games, where every player move is immediately visible to the entire community. The fundamental challenge with this setup is that every leaderboard submission leaks information about the held-out evaluation set, which conflicts with the fundamental tenant in machine learning about separation of training and test data. These "leaks", accumulated over long periods of time, threaten the validity of the insights that can be derived from the leaderboards. In this paper, we share our experiences grappling with this issue over the past few years and how our considerations are operationalized into a coherent submission policy. Our work provides a useful guide to help the community understand the design choices made in the popular MS MARCO leaderboards and offers lessons for designers of future leaderboards.

References

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).

[2]

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21). 610--623.

Digital Library

[3]

Andrei Broder. 2002. A Taxonomy of Web Search. SIGIR Forum 36, 2 (2002), 3--10.

Digital Library

[4]

Kenneth Ward Church and Valia Kordoni. 2022. Emerging Trends: SOTA-Chasing. Natural Language Engineering 28, 2 (2022), 249--269.

[5]

Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. 2019. The SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019). Paris, France, 1432--1434.

[6]

Matt Crane. 2018. Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results. Transactions of the Association for Computational Linguistics 6 (2018), 241--252.

[7]

Nick Craswell, Bhaskar Mitra, Daniel Campos, Emine Yilmaz, and Jimmy Lin. 2021. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 1566--1576.

Digital Library

[8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020). Gaithersburg, Maryland.

[9]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 Deep Learning Track. In Proceedings of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019). Gaithersburg, Maryland.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota, 4171--4186.

[11]

Bernard Koch, Emily Denton, Alex Hanna, and Jacob G. Foster. 2021. Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. arXiv:2112.01716 (2021).

[12]

Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz. 2021. Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2283--2287.

Digital Library

[13]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool Publishers.

[14]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v1 (2016).

[15]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085 (2019).

[16]

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. AI and the Everything in the Whole Wide World Benchmark. arXiv:2111.15366 (2021).

[17]

Meredith Whittaker. 2021. The Steep Cost of Capture. Interactions 28, 6 (2021), 50--55.

Digital Library

[18]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).

Cited By

Zhang XThakur NOgundepo OKamalloo EAlfonso-Hermelo DLi XLiu QRezagholizadeh MLin J(2023) MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages Transactions of the Association for Computational Linguistics10.1162/tacl_a_0059511(1114-1131)Online publication date: 1-Sep-2023
https://doi.org/10.1162/tacl_a_00595
Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
Pourbahman ZMomtazi SBagheri A(2023)Deep neural ranking model using distributed smoothingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119913224:COnline publication date: 15-Aug-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119913

Index Terms

Fostering Coopetition While Plugging Leaks: The Design and Implementation of the MS MARCO Leaderboards
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

Anomaly Detection in Player Performances in Multiplayer Online Battle Arena Games
ACSW '22: Proceedings of the 2022 Australasian Computer Science Week

Esports are digital video games that are played professionally. In recent years there has been a growing need to improve the broadcast experience by incorporating real-time data-driven analytics. In these same games, when played by the general public, ...
Secrets of Gosu: Understanding Physical Combat Skills of Professional Players in First-Person Shooters
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

In first-person shooters (FPS), professional players (a.k.a., Gosu) outperform amateur players. The secrets behind the performance of professional FPS players have been debated in online communities with many conjectures; however, attempts of ...
A Dataset to Investigate First-Person Shooter Players
CHI PLAY '22: Extended Abstracts of the 2022 Annual Symposium on Computer-Human Interaction in Play

Datasets are multi-purpose research tools, enabling researchers to design, develop, and test solutions to classical computer sciences problems and novel research questions. In the gaming domain, however, there are few high-quality datasets providing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XThakur NOgundepo OKamalloo EAlfonso-Hermelo DLi XLiu QRezagholizadeh MLin J(2023) MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages Transactions of the Association for Computational Linguistics10.1162/tacl_a_0059511(1114-1131)Online publication date: 1-Sep-2023
https://doi.org/10.1162/tacl_a_00595
Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
Pourbahman ZMomtazi SBagheri A(2023)Deep neural ranking model using distributed smoothingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119913224:COnline publication date: 15-Aug-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119913

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents