More Web Proxy on the site http://driver.im/

research-article

Neural text generation for query expansion in information retrieval

Author:

Vincent ClaveauAuthors Info & Claims

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Pages 202 - 209

https://doi.org/10.1145/3486622.3493957

Published: 13 April 2022 Publication History

Abstract

Expanding users’ query is a well-known way to improve the performance of document retrieval systems. Several approaches have been proposed in the literature, and some of them are considered as yielding state-of-the-art results in Information Retrieval. In this paper, we explore the use of text generation to automatically expand the queries. We rely on a well-known neural generative model, OpenAI’s GPT-2, that comes with pre-trained models for English but can also be fine-tuned on specific corpora. Through different experiments and several datasets, we show that text generation is a very effective way to improve the performance of an IR system, with a large margin (+10 %MAP gains), and that it outperforms strong baselines also relying on query expansion (RM3). This conceptually simple approach can easily be implemented on any IR system thanks to the availability of GPT code and models.

References

[1]

Nasreen Abdul-jaleel, James Allan, W. Bruce Croft, O Diaz, Leah Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. In In Proceedings of TREC-13.

[2]

Ali Amin-Nejad, Julia Ive, and Sumithra Velupillai. 2020. Exploring Transformer Text Generation for Medical Dataset Augmentation. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4699–4708. https://www.aclweb.org/anthology/2020.lrec-1.578

[3]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606(2016).

[4]

Charles L.A. Clarke and Ian Soboroff. 2005. The TREC 2005 Terabyte Track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Gaithersburg, MD, USA.

[5]

Vincent Claveau. 2020. Detecting fake news in tweets from text and propagation graph: IRISA’s participation to the FakeNews task at MediaEval 2020. In MediaEval Benchmarking Initiative for Multimedia Evaluation (MediaEval 2020). online, United States. https://hal.archives-ouvertes.fr/hal-03116027

[6]

Vincent Claveau and Ewa Kijak. 2016. Direct vs. indirect evaluation of distributional thesauri. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 1837–1848. https://www.aclweb.org/anthology/C16-1173

[7]

Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Jul 2019). https://doi.org/10.1145/3331184.3331303

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). arxiv:1810.04805 [cs.CL]

[9]

F. Diaz, B. Mitra, and N. Craswell. 2016. Query expansion with locally-trained word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[10]

Donna Harman. 1995. Overview of the Second Text Retrieval Conference (TREC-2). Information Processing and Management 31, 3 (1995), 271–289.

Digital Library

[11]

William Hersh, Chris Buckley, T. J. Leone, and David Hickam. 1994. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland) (SIGIR ’94). Springer-Verlag New York, Inc., New York, NY, USA, 192–201. http://dl.acm.org/citation.cfm?id=188490.188557

[12]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 39–48. https://doi.org/10.1145/3397271.3401075

Digital Library

[13]

Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, Suzhou, China, 18–26. https://www.aclweb.org/anthology/2020.lifelongnlp-1.3

[14]

Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. 2018. NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4482–4491. https://doi.org/10.18653/v1/D18-1478

[15]

Jimmy Lin. 2018. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum 52, 2 (2018), 40–51. https://doi.org/10.1145/3308774.3308781

Digital Library

[16]

Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proc. of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM ’11). ACM, New York, NY, USA, 7–16. https://doi.org/10.1145/2063576.2063584

Digital Library

[17]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, USA.

[18]

D. Metzler and W.B. Croft. 2004. Combining the Language Model and Inference Network Approaches to Retrieval. Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval 40, 5 (2004), 735–750.

Digital Library

[19]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality

Digital Library

[20]

George A. Miller. 1990. WordNet: An On-Line Lexical Database. International Journal of Lexicography 3, 4 (1990).

[21]

Shahrzad Naseri, Jeffrey Dalton, Andrew Yates, and James Allan. 2021. CEQE: Contextualized Embeddings for Query Expansion. In Proceedings of European Conference in Information Retrieval ECIR. Lucca, IT (virtual event).

Digital Library

[22]

Rodrigo Nogueira. 2019. From doc2query to docTTTTTquery. In An MS MARCO passage retrieval task micro-publication.

[23]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. (2019). arxiv:1904.08375 [cs.IR]

[24]

Ramith Padaki, Zhuyun Dai, and Jamie Callan. 2020. Rethinking Query Expansion for BERT Reranking. In Advances in Information Retrieval, Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer International Publishing, Cham, 297–304.

[25]

Yannis Papanikolaou and Andrea Pierleoni. 2020. DARE: Data Augmented Relation Extraction with GPT-2. arxiv:2004.13845 [cs.CL]

[26]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-1162

[27]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog (2019).

[28]

Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.

[29]

Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. 1998. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. In Proc. of the 7th Text Retrieval Conference, TREC-7. 199–210.

[30]

Ian Ruthven and Mounia Lalmas. 2003. A survey on the use of relevance feedback for information access systems.Knowledge Eng. Review 18, 2 (2003), 95–145.

Digital Library

[31]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (2020). arxiv:1910.01108 [cs.CL]

[32]

T. Strohman, D. Metzler, H. Turtle, and W.B. Croft. 2005. Indri: A language-model based search engine for complex queries (extended version). Technical Report. CIIR.

[33]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arxiv:2104.08663 [cs.IR]

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Digital Library

[35]

Ellen Voorhees. 2004. Overview of the TREC 2004 Robust Track. In Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004.

[36]

Ellen M. Voorhees. 1994. Query Expansion Using Lexical-semantic Relations. In Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’94). Springer-Verlag New York, Inc., New York, NY, USA, 61–69. http://dl.acm.org/citation.cfm?id=188490.188508

Digital Library

[37]

Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the ”Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1129–1132. https://doi.org/10.1145/3331184.3331340

Digital Library

[38]

C. Zhai and J. D. Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of the SIGIR conference. 334–342.

Digital Library

[39]

Zhi Zheng, Kai Hui, Ben He, Xianpei Han, Le Sun, and Andrew Yates. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4718–4728. https://doi.org/10.18653/v1/2020.findings-emnlp.424

Cited By

Melhem WAbdi AMeziane F(2024)Deep Learning Classification of Traffic-Related Tweets: An Advanced Framework Using Deep Learning for Contextual Understanding and Traffic-Related Short Text ClassificationApplied Sciences10.3390/app14231100914:23(11009)Online publication date: 27-Nov-2024
https://doi.org/10.3390/app142311009
Bijker RMerkouris SDowling NRodda S(2024)ChatGPT for Automated Qualitative Research: Content AnalysisJournal of Medical Internet Research10.2196/5905026(e59050)Online publication date: 25-Jul-2024
https://doi.org/10.2196/59050
Formal TLassance CPiwowarski BClinchant S(2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3634912
Show More Cited By

Recommendations

Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Improving query expansion using pseudo-relevant web knowledge for information retrieval
Highlights
- Web knowledge-based query expansion technique uses the top N pseudo relevant web pages
Abstract
In the field of information retrieval, query expansion (QE) has long been used as a technique to deal with the fundamental issue of word mismatch between a user’s query and the target information. In the context of the relationship ...
Query expansion techniques for information retrieval: A survey
Abstract
With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

December 2021

698 pages

ISBN:9781450391153

DOI:10.1145/3486622

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WI-IAT '21

Sponsor:

SIGAI

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence

December 14 - 17, 2021

VIC, Melbourne, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
266
Total Downloads

Downloads (Last 12 months)125
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Melhem WAbdi AMeziane F(2024)Deep Learning Classification of Traffic-Related Tweets: An Advanced Framework Using Deep Learning for Contextual Understanding and Traffic-Related Short Text ClassificationApplied Sciences10.3390/app14231100914:23(11009)Online publication date: 27-Nov-2024
https://doi.org/10.3390/app142311009
Bijker RMerkouris SDowling NRodda S(2024)ChatGPT for Automated Qualitative Research: Content AnalysisJournal of Medical Internet Research10.2196/5905026(e59050)Online publication date: 25-Jul-2024
https://doi.org/10.2196/59050
Formal TLassance CPiwowarski BClinchant S(2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3634912
Ayoub MSu ZLi QChua TNgo CKumar RLauw HKa-Wei Lee R(2024)A Case Study of Enhancing Sparse Retrieval using LLMsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651945(1609-1615)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651945
Swarup ABhandarkar ADizon-Paradis OWilson RWoodard D(2024)Maximizing Relation Extraction Potential: A Data-Centric Study to Unveil Challenges and OpportunitiesIEEE Access10.1109/ACCESS.2024.349473712(167655-167682)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3494737
Vishwakarma DKumar S(2023)Analysis of Recent Query Expansion Techniques for Information Retrieval SystemsProceedings of the International Conference on Intelligent Computing, Communication and Information Security10.1007/978-981-99-1373-2_29(375-383)Online publication date: 4-Jul-2023
https://doi.org/10.1007/978-981-99-1373-2_29
Nehar ABellaouar SMahfoud DDaoudi F(2022)A Hybrid Semantic Statistical Query Expansion for Arabic Information Retrieval Systems2022 5th International Symposium on Informatics and its Applications (ISIA)10.1109/ISIA55826.2022.9993572(1-6)Online publication date: 29-Nov-2022
https://doi.org/10.1109/ISIA55826.2022.9993572

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents