More Web Proxy on the site http://driver.im/

research-article

Public Access

Analysis of the Paragraph Vector Model for Information Retrieval

Authors:

W. Bruce CroftAuthors Info & Claims

ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pages 133 - 142

https://doi.org/10.1145/2970398.2970409

Published: 12 September 2016 Publication History

Abstract

Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.

References

[1]

Q. Ai, L. Yang, J. Guo, and W. B. Croft. Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2016.

Digital Library

[2]

A. Atreya and C. Elkan. Latent semantic indexing (lsi) fails for trec collections. ACM SIGKDD Explorations Newsletter, 12(2):5--10, 2011.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, Mar. 2003.

[4]

K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(02):163--190, 1995.

[5]

A. M. Dai, C. Olah, Q. V. Le, and G. S. Corrado. Document embedding with paragraph vectors. In NIPS Deep Learning Workshop, 2014.

[6]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391--407, 1990.

[7]

D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 795--798. ACM, 2015.

Digital Library

[8]

T. L. Griffths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228--5235, 2004.

[9]

D. Hiemstra and W. Kraaij. Twenty-one at trec-7: Ad-hoc and cross-language track. 1999.

[10]

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57. ACM, 1999.

Digital Library

[11]

S. Huston and W. B. Croft. A comparison of retrieval models using term dependencies. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 111--120. ACM, 2014.

Digital Library

[12]

R. Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191--202. ACM, 1993.

Digital Library

[13]

Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014.

Digital Library

[14]

O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177--2185, 2014.

Digital Library

[15]

X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186--193. ACM, 2004.

Digital Library

[16]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[17]

T. Mikolov, I. Sutskever, K. Chen, G. S. CJorrado, and M. I. Dean, Jeffdan. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.

Digital Library

[18]

E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana. Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 83--84. International World Wide Web Conferences Steering Committee, 2016.

Digital Library

[19]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275--281. ACM, 1998.

Digital Library

[20]

S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503--520, 2004.

[21]

M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical signi cance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 623--632. ACM, 2007.

Digital Library

[22]

F. Sun, J. Guo, Y. Lan, J. Xu, and X. Cheng. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the 53rd Annual Annual Meeting of the Association for Computational Linguistics, 2015.

[23]

I. Vulić and M.-F. Moens. Monolingual and cross-lingualinformation retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 363--372. ACM, 2015.

Digital Library

[24]

X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 178--185, New York, NY, USA, 2006. ACM.

Digital Library

[25]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342. ACM, 2001.

Digital Library

[26]

L. Zhao and J. Callan. Term necessity prediction. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 259--268. ACM, 2010.

Digital Library

[27]

G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, pages Article-No. ACM, 2015.

Digital Library

Cited By

Ghosal SJain A(2024)CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoostMultimedia Tools and Applications10.1007/s11042-024-18791-yOnline publication date: 1-Apr-2024
https://doi.org/10.1007/s11042-024-18791-y
YÜREKLİ A(2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
https://doi.org/10.18038/estubtda.1175001
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Show More Cited By

Index Terms

Analysis of the Paragraph Vector Model for Information Retrieval
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval models and ranking
      1. Language models

Recommendations

Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Incorporating topic level estimation into language models has been shown to be beneficial for information retrieval (IR) models such as cluster-based retrieval and LDA-based document representation. Neural embedding models, such as paragraph vector (PV) ...
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Comparing different units for query translation in Chinese cross-language information retrieval
InfoScale '07: Proceedings of the 2nd international conference on Scalable information systems

Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

September 2016

318 pages

ISBN:9781450344975

DOI:10.1145/2970398

General Chairs:
Ben Carterette
University of Delaware, USA
,
Hui Fang
University of Delaware, USA
,
Program Chairs:
Mounia Lalmas
Yahoo! Labs, UK
,
Jian-Yun Nie
University of Montreal, Canada

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICTIR '16

Sponsor:

SIGIR

ICTIR '16: ACM SIGIR International Conference on the Theory of Information Retrieval

September 12 - 16, 2016

Delaware, Newark, USA

Acceptance Rates

ICTIR '16 Paper Acceptance Rate 41 of 79 submissions, 52%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
984
Total Downloads

Downloads (Last 12 months)145
Downloads (Last 6 weeks)24

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ghosal SJain A(2024)CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoostMultimedia Tools and Applications10.1007/s11042-024-18791-yOnline publication date: 1-Apr-2024
https://doi.org/10.1007/s11042-024-18791-y
YÜREKLİ A(2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
https://doi.org/10.18038/estubtda.1175001
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
De Vrindt MVan den Noortgate WDebeer D(2022)Text Mining to Alleviate the Cold-Start Problem of Adaptive Comparative JudgmentsFrontiers in Education10.3389/feduc.2022.8543787Online publication date: 4-Jul-2022
https://doi.org/10.3389/feduc.2022.854378
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Wu RZhang LTony Cai T(2022)Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical InferenceJournal of the American Statistical Association10.1080/01621459.2021.2018329118:543(1849-1861)Online publication date: 31-Jan-2022
https://doi.org/10.1080/01621459.2021.2018329
Oliveira Gonçalves CCamacho RGonçalves CSeara Vieira ABorrajo Diz LLorenzo Iglesias E(2021)Classification of Full Text Biomedical Documents: Sections Importance AssessmentApplied Sciences10.3390/app1106267411:6(2674)Online publication date: 17-Mar-2021
https://doi.org/10.3390/app11062674
Lenz DWinker P(2020)Measuring the diffusion of innovations with paragraph vector topic modelsPLOS ONE10.1371/journal.pone.022668515:1(e0226685)Online publication date: 22-Jan-2020
https://doi.org/10.1371/journal.pone.0226685
Qi YZhang JXu WGuo J(2020)Salient context-based semantic matching for information retrievalEURASIP Journal on Advances in Signal Processing10.1186/s13634-020-00688-12020:1Online publication date: 11-Jul-2020
https://doi.org/10.1186/s13634-020-00688-1
Xu SLiu FHuang ZPeng YLi D(2020)A BERT-Based Semantic Matching Ranker for Open-Domain Question AnsweringProceedings of the 4th International Conference on Natural Language Processing and Information Retrieval10.1145/3443279.3443301(31-36)Online publication date: 18-Dec-2020
https://dl.acm.org/doi/10.1145/3443279.3443301
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents