[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Neural Network Guided Fast and Efficient Query-Based Stemming by Predicting Term Co-occurrence Statistics

Published: 24 March 2022 Publication History

Abstract

In information retrieval (IR) systems, stemming is used as a recall enhancing strategy to address the vocabulary mismatch problem arising due to morphological phenomena. Co-occurrence statistics-oriented query-based stemmers have demonstrated remarkable improvement in the retrieval effectiveness. This process involves computation of corpus co-occurrence statistics between different pairs of words to discover morphological variations. This computation, being performed online, has negative impact on the overall search efficiency given that the computation of the co-occurrence statistics is time demanding. The central objective of this work is to develop a faster but approximate method for estimating the co-occurrence of two given words in a corpus to address the efficiency shortcomings of the query-based stemmers. This work presents an empirical study to observe the effect of using predicated co-occurrence in query-based stemming. In particular, the proposed query expansion algorithm aims at predicting the pointwise mutual information (PMI) between a word pair using a ridge regression-based neural network to enhance the efficiency. The neural network model takes as input the word embeddings of a pair of words and learns to predict the PMI value. The predicted PMI value is then used to assign relative importance to the obtained morphological variations to be included in the final query. A set of experiments performed on three different TREC collections to experimentally validate the effectiveness and efficiency of proposed algorithm. The results show that the proposed stemming approach leads to a remarkable efficiency improvement over the existing query-based stemmers (93×105 times) without significantly impeding the retrieval effectiveness.

References

[1]
Almuzaini HA and Azmi AM Impact of stemming and word embedding on deep learning-based arabic text categorization IEEE Access. 2020 8 127913-127928
[2]
Alnaied A, Elbendak M, and Bulbul A An intelligent use of stemmer and morphology analysis for arabic information retrieval Egypt Inform J 2020 21 209-217
[3]
Alotaibi FS and Gupta V A cognitive inspired unsupervised language-independent text stemmer for information retrieval Cogn Syst Res 2018 52 291-300
[4]
Baroni M, Matiasek J, Trost H. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the ACL-02 workshop on Morphological and phonological learning -Association for Computational Linguistics. 2002.
[5]
Basu M, Roy A, Ghosh K Bandyopadhyay S, Ghosh S. A novel word embedding based stemming approach for microblog retrieval during disasters. In: European Conference on Information Retrieval,Springer. 2017; pp. 589–597.
[6]
Brychcín T and Konopík M Hps: High precision stemmer Inform Process Manage 2015 51 68-91
[7]
Buck C, Bulian J, Ciaramita M, Gesmundo A, Houlsby N, Gajewski W, Wang W. Ask the right questions: Active question reformulation with reinforcement learning. 2017. CoRR. abs/1705.07830. arXiv:1705.07830.
[8]
Collins-Thompson K, Callan J. Estimation and use of uncertainty in pseudo-relevance feedback. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA. 2007; pp. 303–310.
[9]
Creut, M, Lagus K. Unsupervised discovery of morphemes. In: Proceedings of the ACL-02 workshop on Morphological and phonological learning-Association for Computational Linguistics. 2002.
[10]
Dolamic L and Savoy J Indexing and stemming approaches for the czech language Inform Process Manage 2009 45 714-720
[11]
Duchi J, Hazan E, and Singer Y Adaptive subgradient methods for online learning and stochastic optimization J Mach Learn Res 2011 12 2121-2159
[12]
Fernández A, Díaz J, Gutiérrez Y, Muñoz R. An unsupervised method to improve spanish stemmer. In: Natural Language Processing and Information Systems. Springer Berlin Heidelberg, 2011; pp. 221–224.
[13]
Huang J, Efthimiadis EN. Analyzing and evaluating query reformulation strategies in web search logs. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, Association for Computing Machinery, New York, NY, USA. 2009; p. 77-86.
[14]
Kanan T, Sadaqa O, Almhirat A, Kanan E. Arabic light stemming: A comparative study between p-stemmer, khoja stemmer, and light10 stemmer. In: 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), IEEE. 2019;.
[15]
Kasthuri M, Kumar SBR, Khaddaj S. PLIS: Proposed language independent stemmer for information retrieval systems using dynamic programming. In: 2017 World Congress on Computing and Communication Technologies (WCCCT), IEEE. 2017.
[16]
Keikha A, Ensan F, and Bagheri E Query expansion using pseudo relevance feedback on wikipedia J Intell Inform Syst 2018 50 455-478
[17]
Krovetz R and Croft WB Lexical ambiguity and information retrieval ACM Trans Inf Syst 1992 10 115-141
[18]
Li X, You S, Chen W. Inducing embeddings for rare words through morphological decomposition, stemming and bidirectional translation. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), IEEE. 2019.
[19]
Lovins JB Development of a stemming algorithm Mech Translat Comp Linguistics 1968 11 22-31
[20]
Lucchese C, Nardini FM, Perego R, Trani R, Venturini R, Efficient and effective query expansion for web search. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA. 2018. pp. 1551–1554.
[21]
Majumder P, Mitra M, Pal D. Bulgarian, hungarian and czech stemming using YASS. In: Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 49–56.
[22]
GOLDSMITH J An algorithm for the unsupervised learning of morphology Nat Lang Eng 2006 12 353-371
[23]
Majumder P, Mitra M, Parui SK, Kole G, Mitra P, and Datta K Yass: Yet another suffix stripper ACM Trans Inf Syst 2007
[24]
Melucci M, Orio N. A novel method for stemmer generation based on hidden markov models. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, ACM, New York, NY, USA. 2003. pp. 131–138.
[25]
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013.https://doi.org/10.4546
[26]
Manning CD, Manning CD, and Schütze H Foundations of statistical natural language processing 1999 MIT press
[27]
Nogueira R, Cho K. Task-oriented query reformulation with reinforcement learning. 2017. CoRR. abs/1704.04572. arXiv:1704.04572.
[28]
Oard DW, Levow GA, and Cabezas CI Peters C Clef experiments at maryland: statistical stemming and backoff translation Cross-language information retrieval and evaluation 2001 Berlin Heidelberg, Berlin Springer 176-187
[29]
Oo Y, Soe KM. Better pretrained embedding with convolutional neural networks for morphological stemming. In: Proceedings of the 2019 3rd International Conference on Artificial Intelligence and Virtual Reality - AIVR 2019, ACM Press. 2019.
[30]
Paik JH, Mitra M, Parui SK, Järvelin K. Gras: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011;29:19:1-19:24.
[31]
Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA 2011b. pp. 863–872.
[32]
Paik JH, Parui SK, Pal D, and Robertson SE Effective and robust query-based stemming ACM Trans Inf Syst 2013 31 181-1829
[33]
Pande BP, Tamta P, and Dhami HS Generation, implementation, and appraisal of an N-gram-based stemming algorithm Digit Scholarsh Humanit 2018 34 558-568
[34]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python J Mach Learn Res 2011 12 2825-2830
[35]
Peng F, Ahmed N, Li X, Lu Y. Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA. 2007. pp. 639–646.
[36]
Pennington J, Socher R, and Manning CD Glove: global vectors for word representation EMNLP. 2014
[37]
Porter MF Readings in information retrieval 1997 San Francisco Morgan Kaufmann Publishers Inc 313-316
[38]
Rosset C, Jose D, Ghosh G, Mitra B, Tiwary S. Optimizing query evaluations using reinforcement learning for web search. 2018. CoRR. abs/1804.04410. arXiv:1804.04410.
[39]
Roy A, Ghorai T, Ghosh K, Ghosh S. Combining local and global word embeddings for microblog stemming. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM. 2017.
[40]
Savoy J. Light stemming approaches for the french, portuguese, german and hungarian languages. In: Proceedings of the 2006 ACM symposium on Applied computing - SAC06, ACM Press. 2006.
[41]
Singh J and Gupta V An efficient corpus-based stemmer Cogn Comput 2017 9 671-688
[42]
Singh J and Gupta V A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics Knowl Based Syst 2019 180 147-162
[43]
Martín Abadi, Ashish A, Barham P. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. 2016. CoRR. abs/1603.04467.
[44]
Sordoni A, Bengio Y, Vahabi H, Lioma C, Simonsen JG, Nie J. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. 2015. CoRR. abs/1507.02221. arXiv:1507.02221.
[45]
Strohman T, Metzler D, Turtle H, Croft W. Indri: A language-model based search engine for complex queries. Information Retrieval - IR . 2005.
[46]
Svozil D, Kvasnicka V, and Pospichal J Introduction to multi-layer feed-forward neural networks Chemom Intell Lab Syst 1997 39 43-62
[47]
Soares Victor Brum, M, Cristiano Prati R, Carolina Monard M. Improvement on the porter’s stemming algorithm for portuguese. IEEE Latin America Transactions. 2009;7:472–7.
[48]
Voorhees EM. Overview of the trec 2003 robust retrieval track. In: proceedings of the twelfth text retrieval conference(TREC 2003), 2003. pp. 69–7710.1.1.2.9779.
[49]
Winarti T, Kerami J, and Arief S Determining term on text document clustering using algorithm of enhanced confix stripping stemming Int J Comput Appl 2017 157 8-13
[50]
Cecillon N, Labatut V, Dufour R, Linares G. Graph Embeddings for Abusive Language Detection. SN Computer Science. Springer Science and Business Media LLC. 2021.
[51]
Xu J and Croft WB Corpus-based stemming using coccurrence of word variants ACM Trans Inf Syst 1998 16 61-81
[52]
Xu J and Croft WB Improving the effectiveness of information retrieval with local context analysis ACM Trans Inf Syst 2000 18 79-112
[53]
Yusuf N, Yunus MAM, Wahid N. Arabic text stemming using query expansion method. In: Advances in Intelligent Systems and Computing. Springer International Publishing, 2019. pp. 3–11.
[54]
Basu M, Ghosh K, Ghosh S. Information Retrieval from Microblogs During Disasters: In the Light of IRMiDis Task. SN Computer Science. 1, Springer Science and Business Media LLC. 2020.
[55]
Das S, Deb N, Cortesi A, Chaki N. Sentence Embedding Models for Similarity Detection of Software Requirements. SN Computer Science. 2, Springer Science and Business Media LLC. 2021.
[56]
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016. CoRR. abs/1609.08144. arXiv:1609.08144.
[57]
Mehala N, Bhatia D. A Concept-Based Approach for Generating Better Topics for Web Search Results. SN Computer Science. 1, Springer Science and Business Media LLC. 2020.
[58]
Kurz T and Stoffel K Going beyond stemming: creating concept signatures of complex medical terms Knowl Based Syst 2002 15 309-313

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SN Computer Science
SN Computer Science  Volume 3, Issue 3
May 2022
1127 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 March 2022
Accepted: 01 March 2022
Received: 15 July 2021

Author Tags

  1. Query-based stemming
  2. Co-occurrence
  3. Pointwise mutual information
  4. Neural network
  5. Word embedding

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media