[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3340531.3412779acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

Published: 19 October 2020 Publication History

Abstract

Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k -anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and other potential uses.

Supplementary Material

MP4 File (3340531.3412779.mp4)
Description of the ORCAS dataset: Open Resource for Click Analysis in Search. This is based on search log data, with aggregation, to identify query-URL pairs that were clicked by many users. The data can be used to improve search or for web mining such as finding related queries.

References

[1]
Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting Semantic Relations from Query Logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Jose, California, USA) (KDD '07). Association for Computing Machinery, New York, NY, USA, 76--85.
[2]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et almbox. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016).
[3]
Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times, Vol. 9, 2008 (2006), 8.
[4]
Doug Beeferman and Adam Berger. 2000 a. Agglomerative Clustering of a Search Engine Query Log. In Proc. SIGKDD. 407--416.
[5]
Doug Beeferman and Adam Berger. 2000 b. Agglomerative Clustering of a Search Engine Query Log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, Massachusetts, USA) (KDD '00). Association for Computing Machinery, New York, NY, USA, 407--416.
[6]
Paul N. Bennett, Ryen W. White, Wei Chu, Susan T. Dumais, Peter Bailey, Fedor Borisyuk, and Xiaoyuan Cui. 2012. Modeling the impact of short- and long-term behavior on search personalization. In Proc. SIGIR. 185--194.
[7]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. ACM, 89--96.
[8]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2019 deep learning track. In TREC.
[9]
Nick Craswell and Martin Szummer. 2007. Random Walks on the Click Graph. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 239--246.
[10]
Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Foundations and trends in information retrieval, Vol. 10, 1 (2016), 1--117.
[11]
Sebastian Hofst"atter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. 2020. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proc. SIGIR. ACM.
[12]
Sebastian Hofst"atter, Markus Zlabinger, and Allan Hanbury. 2019. TU Wien@ TREC Deep Learning'19--Simple Contextualization for Re-ranking. arXiv preprint arXiv:1912.01385 (2019).
[13]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333--2338.
[14]
Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundation and Trends in Information Retrieval, Vol. 3, 3 (March 2009), 225--331.
[15]
Qiaozhu Mei, Dengyong Zhou, and Kenneth Church. 2008. Query Suggestion Using Hitting Time. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (Napa Valley, California, USA) (CIKM '08). Association for Computing Machinery, New York, NY, USA, 469--478.
[16]
Bhaskar Mitra. 2015. Exploring Session Context using Distributed Representations of Queries and Reformulations. In Proc. SIGIR. ACM, 3--12.
[17]
Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval (2018).
[18]
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In Proc. WWW. 1291--1299.
[19]
Bhaskar Mitra, Sebastian Hofstatter, Hamed Zamani, and Nick Craswell. 2020. Conformer-Kernel with Query Term Independence for Document Retrieval. arXiv preprint arXiv:2007.10434 (2020).
[20]
Bhaskar Mitra, Corby Rosset, David Hawking, Nick Craswell, Fernando Diaz, and Emine Yilmaz. 2019. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. arXiv preprint arXiv:1907.03693 (2019).
[21]
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A Picture of Search. In Proceedings of the 1st International Conference on Scalable Information Systems (Hong Kong) (InfoScale '06). Association for Computing Machinery, New York, NY, USA, 1--es.
[22]
Pavel Serdyukov, Georges Dupret, and Nick Craswell. 2014. Log-based personalization: The 4th web search click data (WSCD) workshop. In Proceedings of the 7th ACM international conference on Web search and data mining. 685--686.
[23]
Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proc. SIGIR. 103--112.
[24]
Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, Vol. 2. Citeseer, 2--6.
[25]
Ji-Rong Wen, Jian-Yun Nie, and Hong-Jiang Zhang. 2001. Clustering User Queries of a Search Engine. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW '01). Association for Computing Machinery, New York, NY, USA, 162--168.
[26]
Stewart Whiting and Joemon M. Jose. 2014. Recent and Robust Query Auto-Completion. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW '14). Association for Computing Machinery, New York, NY, USA, 971--982.
[27]
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing Web Search Using Web Click-through Data. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM '04). Association for Computing Machinery, New York, NY, USA, 118--126.
[28]
Yuye Zhang and Alistair Moffat. 2006. Some Observations on User Search Behaviour. Austr. J. Intelligent Information Processing Systems, Vol. 9, 2 (2006), 1--8.

Cited By

View all
  • (2024)SimIIR 3: A Framework for the Simulation of Interactive and Conversational Information RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698427(197-202)Online publication date: 8-Dec-2024
  • (2024)Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671618(5669-5679)Online publication date: 25-Aug-2024
  • (2024)Retrogressive Document Manipulation of US Federal Environmental WebsitesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679988(3762-3766)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. user behavior data
  3. web search

Qualifiers

  • Research-article

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SimIIR 3: A Framework for the Simulation of Interactive and Conversational Information RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698427(197-202)Online publication date: 8-Dec-2024
  • (2024)Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671618(5669-5679)Online publication date: 25-Aug-2024
  • (2024)Retrogressive Document Manipulation of US Federal Environmental WebsitesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679988(3762-3766)Online publication date: 21-Oct-2024
  • (2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
  • (2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
  • (2024)Cloak: Hiding Retrieval Information in Blockchain Systems via Distributed Query RequestsIEEE Transactions on Services Computing10.1109/TSC.2024.3411450(1-14)Online publication date: 2024
  • (2024)A Heuristic Multimedia Verticals Aggregated Search Approach and User Behavioral Analysis2024 International Conference on Engineering & Computing Technologies (ICECT)10.1109/ICECT61618.2024.10581027(1-6)Online publication date: 23-May-2024
  • (2024)MCFC: A Momentum-Driven Clicked Feature Compressed Pre-trained Language Model for Information RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9431-7_6(69-82)Online publication date: 1-Nov-2024
  • (2023)Dense Text Retrieval Based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/363787042:4(1-60)Online publication date: 18-Dec-2023
  • (2023)Contextualizing and Expanding Conversational Queries without SupervisionACM Transactions on Information Systems10.1145/363262242:3(1-30)Online publication date: 17-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media