[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2970398.2970409acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article
Public Access

Analysis of the Paragraph Vector Model for Information Retrieval

Published: 12 September 2016 Publication History

Abstract

Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.

References

[1]
Q. Ai, L. Yang, J. Guo, and W. B. Croft. Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2016.
[2]
A. Atreya and C. Elkan. Latent semantic indexing (lsi) fails for trec collections. ACM SIGKDD Explorations Newsletter, 12(2):5--10, 2011.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, Mar. 2003.
[4]
K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(02):163--190, 1995.
[5]
A. M. Dai, C. Olah, Q. V. Le, and G. S. Corrado. Document embedding with paragraph vectors. In NIPS Deep Learning Workshop, 2014.
[6]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391--407, 1990.
[7]
D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 795--798. ACM, 2015.
[8]
T. L. Griffths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228--5235, 2004.
[9]
D. Hiemstra and W. Kraaij. Twenty-one at trec-7: Ad-hoc and cross-language track. 1999.
[10]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57. ACM, 1999.
[11]
S. Huston and W. B. Croft. A comparison of retrieval models using term dependencies. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 111--120. ACM, 2014.
[12]
R. Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191--202. ACM, 1993.
[13]
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014.
[14]
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177--2185, 2014.
[15]
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186--193. ACM, 2004.
[16]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[17]
T. Mikolov, I. Sutskever, K. Chen, G. S. CJorrado, and M. I. Dean, Jeffdan. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[18]
E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 83--84. International World Wide Web Conferences Steering Committee, 2016.
[19]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275--281. ACM, 1998.
[20]
S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503--520, 2004.
[21]
M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical signi cance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 623--632. ACM, 2007.
[22]
F. Sun, J. Guo, Y. Lan, J. Xu, and X. Cheng. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the 53rd Annual Annual Meeting of the Association for Computational Linguistics, 2015.
[23]
I. Vulić and M.-F. Moens. Monolingual and cross-lingualinformation retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 363--372. ACM, 2015.
[24]
X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 178--185, New York, NY, USA, 2006. ACM.
[25]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342. ACM, 2001.
[26]
L. Zhao and J. Callan. Term necessity prediction. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 259--268. ACM, 2010.
[27]
G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, pages Article-No. ACM, 2015.

Cited By

View all
  • (2024)CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoostMultimedia Tools and Applications10.1007/s11042-024-18791-yOnline publication date: 1-Apr-2024
  • (2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
  • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
  • Show More Cited By

Index Terms

  1. Analysis of the Paragraph Vector Model for Information Retrieval

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
      September 2016
      318 pages
      ISBN:9781450344975
      DOI:10.1145/2970398
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 September 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. language model
      2. paragraph vector

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ICTIR '16
      Sponsor:

      Acceptance Rates

      ICTIR '16 Paper Acceptance Rate 41 of 79 submissions, 52%;
      Overall Acceptance Rate 235 of 527 submissions, 45%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)145
      • Downloads (Last 6 weeks)24
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoostMultimedia Tools and Applications10.1007/s11042-024-18791-yOnline publication date: 1-Apr-2024
      • (2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
      • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
      • (2022)Text Mining to Alleviate the Cold-Start Problem of Adaptive Comparative JudgmentsFrontiers in Education10.3389/feduc.2022.8543787Online publication date: 4-Jul-2022
      • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
      • (2022)Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical InferenceJournal of the American Statistical Association10.1080/01621459.2021.2018329118:543(1849-1861)Online publication date: 31-Jan-2022
      • (2021)Classification of Full Text Biomedical Documents: Sections Importance AssessmentApplied Sciences10.3390/app1106267411:6(2674)Online publication date: 17-Mar-2021
      • (2020)Measuring the diffusion of innovations with paragraph vector topic modelsPLOS ONE10.1371/journal.pone.022668515:1(e0226685)Online publication date: 22-Jan-2020
      • (2020)Salient context-based semantic matching for information retrievalEURASIP Journal on Advances in Signal Processing10.1186/s13634-020-00688-12020:1Online publication date: 11-Jul-2020
      • (2020)A BERT-Based Semantic Matching Ranker for Open-Domain Question AnsweringProceedings of the 4th International Conference on Natural Language Processing and Information Retrieval10.1145/3443279.3443301(31-36)Online publication date: 18-Dec-2020
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media