More Web Proxy on the site http://driver.im/

research-article

Smoothing document language models with probabilistic term count propagation

Authors:

Azadeh Shakery,

ChengXiang ZhaiAuthors Info & Claims

Information Retrieval, Volume 11, Issue 2

Pages 139 - 164

https://doi.org/10.1007/s10791-007-9041-9

Published: 01 April 2008 Publication History

Abstract

Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.

References

[1]

Balog, K., Azzopardi, L., & de Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR ’06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 43–50). New York: ACM Press.

[2]

Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 239–246). New York: ACM Press.

[3]

Fang, H., & Zhai, C. (2007). Probabilistic models for expert finding. In Proceedings of the 29th European Conference on Information Retrieval (ECIR’07) (pp. 418–430).

[4]

Frisse, M. E. (1987). Searching for information in a hypertext medical handbook. In HYPERTEXT ’87: Proceeding of the ACM Conference on Hypertext (pp. 57–66). New York: ACM Press.

[5]

Furuta R., Plaisant C., and Shneiderman B. A spectrum of automatic hypertext constructions Hypermedia 1989 1 2 179-195

[6]

Grimmett, G., & Stirzaker, D. (1989). Probability and random processes. Oxford University Press.

[7]

Haveliwala T. H. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search Knowledge and Data Engineering, IEEE Transactions 2003 15 4 784-796

[8]

Hiemstra, D., & Kraaij, W. (1998). Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proceedings of Seventh Text REtrieval Conference (TREC-7) (pp. 227–238).

[9]

Kleinberg J. M. Authoritative sources in a hyperlinked environment Journal of the ACM 1999 46 5 604-632

[10]

Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In SIGIR ’04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 194–201). New York: ACM Press.

[11]

Kurland, O., & Lee, L. (2005). Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In SIGIR ’05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 306–313). New York: ACM Press.

[12]

Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). New York: ACM Press.

[13]

Lavrenko, V., & Croft, B. W. (2001). Relevance based language models. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 120–127). New York: ACM Press.

[14]

Lempel, R., & Moran, S. (2000). The stochastic approach for link-structure analysis (salsa) and the tkc effect. In Proceedings of the 9th International World Wide Web Conference on Computer Networks: the International Journal of Computer and Telecommunications Netoworking (pp. 387–401). Amsterdam: North-Holland Publishing Co.

[15]

Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR ’04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 186–193). New York: ACM Press.

[16]

Miller, D. R. H., Leek, T., & Schwartz, R. M. (1999). A hidden markov model information retrieval system. In SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 214–221). New York: ACM Press.

[17]

Ogilvie, P., & Callan, J. (2003). Combining document representations for known-item search. In SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (pp. 143–150). New York: ACM Press.

[18]

Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library.

[19]

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–281). New York: ACM Press.

[20]

Rocchio, J. (1971). Relevance feedback in information retrieval. In In The SMART Retrieval System: Experiments in Automatic Document Processing (pp. 313–323). PrenticeHall.

[21]

Shakery, A., & Zhai, C. (2006). A probabilistic relevance propagation model for hypertext retrieval. In CIKM ’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (pp. 550–558). New York: ACM Press.

[22]

Si, L., & Callan, J. (2005). Modeling search engine effectiveness for federated search. In SIGIR ’05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 83–90). New York: ACM Press.

[23]

Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 407–414). Morristown: Association for Computational Linguistics.

[24]

Tombros, A. (2002). The effectiveness of query-based hierarchic clustering of documents for information retrieval. Technical Report, PhD thesis, University of Glasgow.

[25]

van Rijsbergen, C. J. (1979). Information retrieval. Butterworth.

[26]

Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 188–196). New York: ACM Press.

[27]

Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval (Digital Libraries and Electronic Publishing). The MIT Press.

[28]

Wilkinson R. and Smeaton A. F. Automatic link generation ACM Computing Surveys 1999 31 4es 27

[29]

Willett P. Recent trends in hierarchic document clustering: A critical review Information Processing and Management 1988 24 5 577-597

[30]

Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 4–11). New York: ACM Press.

[31]

Zhai, C., & Lafferty, J. (2001a). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 403–410). New York: ACM Press.

[32]

Zhai, C., & Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 334–342). New York: ACM Press.

[33]

Zhai, C., & Lafferty, J. (2001c). Two stage language models for information retrieval. In SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49–56). New York: ACM Press.

Cited By

Zamani HCroft WCaverlee JHu XLalmas MWang W(2020)Learning a Joint Search and Recommendation Model from User-Item InteractionsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371818(717-725)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1145/3336191.3371818
Mulhem PChevallet J(2013)Reading contexts for structured documents retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491760(47-52)Online publication date: 15-May-2013
https://dl.acm.org/doi/10.5555/2491748.2491760
Broder AGabrilovich EJosifovski VMavromatis GMetzler DWang JHuang JKoudas NJones GWu XCollins-Thompson KAn A(2010)Exploiting site-level information to improve web searchProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871630(1393-1396)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871437.1871630
Show More Cited By

Index Terms

Smoothing document language models with probabilistic term count propagation
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

An analysis on document length retrieval trends in language modeling smoothing
Abstract
Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length ...
A Pólya Urn Document Language Model for Improved Information Retrieval

The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a ...
Document Expansion Using External Collections
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Document expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 11, Issue 2

Apr 2008

96 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Science+Business Media, LLC 2007.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2008

Accepted: 11 December 2007

Received: 25 June 2007

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zamani HCroft WCaverlee JHu XLalmas MWang W(2020)Learning a Joint Search and Recommendation Model from User-Item InteractionsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371818(717-725)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1145/3336191.3371818
Mulhem PChevallet J(2013)Reading contexts for structured documents retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491760(47-52)Online publication date: 15-May-2013
https://dl.acm.org/doi/10.5555/2491748.2491760
Broder AGabrilovich EJosifovski VMavromatis GMetzler DWang JHuang JKoudas NJones GWu XCollins-Thompson KAn A(2010)Exploiting site-level information to improve web searchProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871630(1393-1396)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871437.1871630
Dai NDavison BWang Y(2010)Mining neighbors' topicality to better control authority flowProceedings of the 32nd European conference on Advances in Information Retrieval10.1007/978-3-642-12275-0_69(653-657)Online publication date: 28-Mar-2010
https://dl.acm.org/doi/10.1007/978-3-642-12275-0_69
Zhai C(2008)Statistical Language Models for Information Retrieval A Critical ReviewFoundations and Trends in Information Retrieval10.1561/15000000082:3(137-213)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1561/1500000008
Mei QZhang DZhai CChua TLeong MMyaeng SOard DSebastiani F(2008)A general optimization framework for smoothing language models on graph structuresProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390438(611-618)Online publication date: 20-Jul-2008
https://dl.acm.org/doi/10.1145/1390334.1390438

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents