[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3486622.3493957acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
research-article

Neural text generation for query expansion in information retrieval

Published: 13 April 2022 Publication History

Abstract

Expanding users’ query is a well-known way to improve the performance of document retrieval systems. Several approaches have been proposed in the literature, and some of them are considered as yielding state-of-the-art results in Information Retrieval. In this paper, we explore the use of text generation to automatically expand the queries. We rely on a well-known neural generative model, OpenAI’s GPT-2, that comes with pre-trained models for English but can also be fine-tuned on specific corpora. Through different experiments and several datasets, we show that text generation is a very effective way to improve the performance of an IR system, with a large margin (+10 %MAP gains), and that it outperforms strong baselines also relying on query expansion (RM3). This conceptually simple approach can easily be implemented on any IR system thanks to the availability of GPT code and models.

References

[1]
Nasreen Abdul-jaleel, James Allan, W. Bruce Croft, O Diaz, Leah Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. In In Proceedings of TREC-13.
[2]
Ali Amin-Nejad, Julia Ive, and Sumithra Velupillai. 2020. Exploring Transformer Text Generation for Medical Dataset Augmentation. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4699–4708. https://www.aclweb.org/anthology/2020.lrec-1.578
[3]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606(2016).
[4]
Charles L.A. Clarke and Ian Soboroff. 2005. The TREC 2005 Terabyte Track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Gaithersburg, MD, USA.
[5]
Vincent Claveau. 2020. Detecting fake news in tweets from text and propagation graph: IRISA’s participation to the FakeNews task at MediaEval 2020. In MediaEval Benchmarking Initiative for Multimedia Evaluation (MediaEval 2020). online, United States. https://hal.archives-ouvertes.fr/hal-03116027
[6]
Vincent Claveau and Ewa Kijak. 2016. Direct vs. indirect evaluation of distributional thesauri. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 1837–1848. https://www.aclweb.org/anthology/C16-1173
[7]
Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Jul 2019). https://doi.org/10.1145/3331184.3331303
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). arxiv:1810.04805 [cs.CL]
[9]
F. Diaz, B. Mitra, and N. Craswell. 2016. Query expansion with locally-trained word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
[10]
Donna Harman. 1995. Overview of the Second Text Retrieval Conference (TREC-2). Information Processing and Management 31, 3 (1995), 271–289.
[11]
William Hersh, Chris Buckley, T. J. Leone, and David Hickam. 1994. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland) (SIGIR ’94). Springer-Verlag New York, Inc., New York, NY, USA, 192–201. http://dl.acm.org/citation.cfm?id=188490.188557
[12]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 39–48. https://doi.org/10.1145/3397271.3401075
[13]
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, Suzhou, China, 18–26. https://www.aclweb.org/anthology/2020.lifelongnlp-1.3
[14]
Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. 2018. NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4482–4491. https://doi.org/10.18653/v1/D18-1478
[15]
Jimmy Lin. 2018. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum 52, 2 (2018), 40–51. https://doi.org/10.1145/3308774.3308781
[16]
Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proc. of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM ’11). ACM, New York, NY, USA, 7–16. https://doi.org/10.1145/2063576.2063584
[17]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, USA.
[18]
D. Metzler and W.B. Croft. 2004. Combining the Language Model and Inference Network Approaches to Retrieval. Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval 40, 5 (2004), 735–750.
[19]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality
[20]
George A. Miller. 1990. WordNet: An On-Line Lexical Database. International Journal of Lexicography 3, 4 (1990).
[21]
Shahrzad Naseri, Jeffrey Dalton, Andrew Yates, and James Allan. 2021. CEQE: Contextualized Embeddings for Query Expansion. In Proceedings of European Conference in Information Retrieval ECIR. Lucca, IT (virtual event).
[22]
Rodrigo Nogueira. 2019. From doc2query to docTTTTTquery. In An MS MARCO passage retrieval task micro-publication.
[23]
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. (2019). arxiv:1904.08375 [cs.IR]
[24]
Ramith Padaki, Zhuyun Dai, and Jamie Callan. 2020. Rethinking Query Expansion for BERT Reranking. In Advances in Information Retrieval, Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer International Publishing, Cham, 297–304.
[25]
Yannis Papanikolaou and Andrea Pierleoni. 2020. DARE: Data Augmented Relation Extraction with GPT-2. arxiv:2004.13845 [cs.CL]
[26]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-1162
[27]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog (2019).
[28]
Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
[29]
Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. 1998. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. In Proc. of the 7th Text Retrieval Conference, TREC-7. 199–210.
[30]
Ian Ruthven and Mounia Lalmas. 2003. A survey on the use of relevance feedback for information access systems.Knowledge Eng. Review 18, 2 (2003), 95–145.
[31]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (2020). arxiv:1910.01108 [cs.CL]
[32]
T. Strohman, D. Metzler, H. Turtle, and W.B. Croft. 2005. Indri: A language-model based search engine for complex queries (extended version). Technical Report. CIIR.
[33]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arxiv:2104.08663 [cs.IR]
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[35]
Ellen Voorhees. 2004. Overview of the TREC 2004 Robust Track. In Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004.
[36]
Ellen M. Voorhees. 1994. Query Expansion Using Lexical-semantic Relations. In Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’94). Springer-Verlag New York, Inc., New York, NY, USA, 61–69. http://dl.acm.org/citation.cfm?id=188490.188508
[37]
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the ”Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1129–1132. https://doi.org/10.1145/3331184.3331340
[38]
C. Zhai and J. D. Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of the SIGIR conference. 334–342.
[39]
Zhi Zheng, Kai Hui, Ben He, Xianpei Han, Le Sun, and Andrew Yates. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4718–4728. https://doi.org/10.18653/v1/2020.findings-emnlp.424

Cited By

View all
  • (2024)Deep Learning Classification of Traffic-Related Tweets: An Advanced Framework Using Deep Learning for Contextual Understanding and Traffic-Related Short Text ClassificationApplied Sciences10.3390/app14231100914:23(11009)Online publication date: 27-Nov-2024
  • (2024)ChatGPT for Automated Qualitative Research: Content AnalysisJournal of Medical Internet Research10.2196/5905026(e59050)Online publication date: 25-Jul-2024
  • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
December 2021
698 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data-augmentation
  2. GPT2
  3. Information Retrieval
  4. Neural language models
  5. Neural text generation
  6. Query expansion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WI-IAT '21
Sponsor:
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence
December 14 - 17, 2021
VIC, Melbourne, Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)125
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Deep Learning Classification of Traffic-Related Tweets: An Advanced Framework Using Deep Learning for Contextual Understanding and Traffic-Related Short Text ClassificationApplied Sciences10.3390/app14231100914:23(11009)Online publication date: 27-Nov-2024
  • (2024)ChatGPT for Automated Qualitative Research: Content AnalysisJournal of Medical Internet Research10.2196/5905026(e59050)Online publication date: 25-Jul-2024
  • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
  • (2024)A Case Study of Enhancing Sparse Retrieval using LLMsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651945(1609-1615)Online publication date: 13-May-2024
  • (2024)Maximizing Relation Extraction Potential: A Data-Centric Study to Unveil Challenges and OpportunitiesIEEE Access10.1109/ACCESS.2024.349473712(167655-167682)Online publication date: 2024
  • (2023)Analysis of Recent Query Expansion Techniques for Information Retrieval SystemsProceedings of the International Conference on Intelligent Computing, Communication and Information Security10.1007/978-981-99-1373-2_29(375-383)Online publication date: 4-Jul-2023
  • (2022)A Hybrid Semantic Statistical Query Expansion for Arabic Information Retrieval Systems2022 5th International Symposium on Informatics and its Applications (ISIA)10.1109/ISIA55826.2022.9993572(1-6)Online publication date: 29-Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media