[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3459637.3482009acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation

Published: 30 October 2021 Publication History

Abstract

Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.

References

[1]
Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2019. Context Attentive Document Ranking and Query Suggestion. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019. ACM, 385--394. https://doi.org/10.1145/3331184.3331246
[2]
Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag. 56, 5 (2019), 1698--1735. https://doi.org/10.1016/j.ipm.2019.05.009
[3]
Marc-Allen Cartright, James Allan, Victor Lavrenko, and Andrew McGregor. 2010. Fast query expansion using approximations of relevance models. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010, Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An (Eds.). ACM, 1573--1576. https://doi.org/10.1145/1871437.1871675
[4]
Zhuyun Dai and Jamie Callan. 2019. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687 (2019).
[5]
Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. 2017. Learning to attend, copy, and generate for session-based query suggestion. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1747--1756.
[6]
Zhiguo Gong, Chan Wa Cheang, and Leong Hou U. 2005. Web Query Expansion by WordNet. In Database and Expert Systems Applications, 16th International Conference, DEXA 2005, Copenhagen, Denmark, August 22--26, 2005, Proceedings (Lecture Notes in Computer Science), Kim Viborg Andersen, John K. Debenham, and Roland R. Wagner (Eds.), Vol. 3588. Springer, 166--175. https://doi.org/10.1007/11546924_17
[7]
Helia Hashemi, Hamed Zamani, and W. Bruce Croft. 2020. Guided Transformer: Leveraging Multiple External Sources for Representation Learning in Conversational Search. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1131--1140. https://doi.org/10.1145/3397271.3401061
[8]
Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving efficient neural ranking models with crossarchitecture knowledge distillation. arXiv preprint arXiv:2010.02666 (2020).
[9]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16--20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
[10]
Andisheh Keikha, Faezeh Ensan, and Ebrahim Bagheri. 2018. Query expansion using pseudo relevance feedback on wikipedia. Journal of Intelligent Information Systems 50, 3 (2018), 455--478.
[11]
Saar Kuzi, Anna Shtok, and Oren Kurland. 2016. Query expansion using word embeddings. In Proceedings of the 25th ACM international on conference on information and knowledge management. 1929--1932.
[12]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: An easy-to-use Python toolkit to support replicable IR research with sparse and dense representations. arXiv preprint arXiv:2102.10073 (2021).
[13]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv preprint arXiv:2010.11386 (2020).
[14]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[15]
Yoshitomo Matsubara, Thuy Vu, and Alessandro Moschitti. 2020. Reranking for Efficient Transformer-based Answer Selection. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1577--1580. https://doi.org/10.1145/3397271.3401266
[16]
Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint (2019).
[17]
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019).
[18]
Yunseok Noh, Yongmin Shin, Junmo Park, A.-Yeong Kim, Su-Jeong Choi, Hyun-Je Song, Seong-Bae Park, and Se-Young Park. 2020. WIRE: An Automated Report Generation System using Topical and Temporal Summarization. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 2169--2172. https://doi.org/10.1145/3397271.3401409
[19]
Chen Qu, Liu Yang, Minghui Qiu, W. Bruce Croft, Yongfeng Zhang, and Mohit Iyyer. 2019. BERT with History Answer Embedding for Conversational Question Answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1133--1136. https://doi.org/10.1145/3331184.3331341
[20]
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. CoRR abs/2010.08191 (2020). arXiv:2010.08191 https://arxiv.org/abs/2010.08191
[21]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1--140:67. http://jmlr.org/papers/v21/20-074.html
[22]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
[23]
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016. Using Word Embeddings for Automatic Query Expansion. CoRR abs/1606.07608 (2016). arXiv:1606.07608 http://arxiv.org/abs/1606.07608
[24]
Ali Asghar Shiri and Crawford Revie. 2006. Query expansion behavior within a thesaurus-enhanced search environment: A user-centered evaluation. J. Assoc. Inf. Sci. Technol. 57, 4 (2006), 462--478. https://doi.org/10.1002/asi.20319
[25]
Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 553--562.
[26]
Mahtab Tamannaee, Hossein Fani, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, and Ebrahim Bagheri. 2020. ReQue: A Configurable Workflow and Dataset Collection for Query Refinement. In CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19--23, 2020, Mathieu d'Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 3165--3172. https://doi.org/10.1145/3340531.3412775
[27]
Arthur H. M. ter Hofstede, Henderik Alex Proper, and Theo P. van der Weide. 1996. Query Formulation as an Information Retrieval Problem. Comput. J. 39, 4 (1996), 255--274. https://doi.org/10.1093/comjnl/39.4.255
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[29]
Xiao Wang, Craig Macdonald, and Iadh Ounis. 2020. Deep Reinforced Query Reformulation for Information Retrieval. CoRR abs/2007.07987 (2020). arXiv:2007.07987 https://arxiv.org/abs/2007.07987
[30]
Joan Xiao and Robert Munro. 2019. Text Summarization of Product Titles. In Proceedings of the SIGIR 2019 Workshop on eCommerce, co-located with the 42st International ACM SIGIR Conference on Research and Development in Information Retrieval, eCom@SIGIR 2019, Paris, France, July 25, 2019 (CEUR Workshop Proceedings), Jon Degenhardt, Surya Kallumadi, Utkarsh Porwal, and Andrew Trotman (Eds.), Vol. 2410. CEUR-WS.org. http://ceur-ws.org/Vol-2410/paper36.pdf
[31]
Chenyan Xiong and Jamie Callan. 2015. Query Expansion with Freebase. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, ICTIR 2015, Northampton, Massachusetts, USA, September 27--30, 2015, James Allan,W. Bruce Croft, Arjen P. de Vries, and Chengxiang Zhai (Eds.). ACM, 111--120. https://doi.org/10.1145/2808194.2809446
[32]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
[33]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7--11, 2017. ACM, 1253--1256. https://doi.org/10.1145/3077136.3080721
[34]
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7--11, 2017, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 505--514. https://doi.org/10.1145/3077136.3080831
[35]
George Zerveas, Ruochen Zhang, Leila Kim, and Carsten Eickhoff. 2020. Brown University at TREC Deep Learning 2019. CoRR abs/2009.04016 (2020). arXiv:2009.04016 https://arxiv.org/abs/2009.04016

Cited By

View all
  • (2025)A contrastive neural disentanglement approach for query performance predictionMachine Learning10.1007/s10994-025-06752-x114:4Online publication date: 25-Feb-2025
  • (2024)Enhanced Retrieval Effectiveness through Selective Query GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679912(3792-3796)Online publication date: 21-Oct-2024
  • (2024)No Query Left Behind: Query Refinement via BacktranslationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679729(1961-1972)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gold standard dataset
  2. query refinement
  3. query reformulation

Qualifiers

  • Research-article

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A contrastive neural disentanglement approach for query performance predictionMachine Learning10.1007/s10994-025-06752-x114:4Online publication date: 25-Feb-2025
  • (2024)Enhanced Retrieval Effectiveness through Selective Query GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679912(3792-3796)Online publication date: 21-Oct-2024
  • (2024)No Query Left Behind: Query Refinement via BacktranslationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679729(1961-1972)Online publication date: 21-Oct-2024
  • (2024)Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational SearchProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679534(1700-1710)Online publication date: 21-Oct-2024
  • (2024)The Surprising Effectiveness of Rankers trained on Expanded QueriesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657938(2652-2656)Online publication date: 10-Jul-2024
  • (2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
  • (2024)Enhancing RAG’s Retrieval via Query BacktranslationsWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_20(270-285)Online publication date: 29-Nov-2024
  • (2024)RePair My Queries: Personalized Query Reformulation via Conditional TransformersWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0579-8_16(219-229)Online publication date: 29-Nov-2024
  • (2023)Neural Disentanglement of Query Difficulty and SemanticsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615189(4264-4268)Online publication date: 21-Oct-2023
  • (2023)RePair: An Extensible Toolkit to Generate Large-Scale Datasets for Query Refinement via TransformersProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615129(5376-5380)Online publication date: 21-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media