[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Smoothing document language models with probabilistic term count propagation

Published: 01 April 2008 Publication History

Abstract

Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.

References

[1]
Balog, K., Azzopardi, L., & de Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR ’06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 43–50). New York: ACM Press.
[2]
Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 239–246). New York: ACM Press.
[3]
Fang, H., & Zhai, C. (2007). Probabilistic models for expert finding. In Proceedings of the 29th European Conference on Information Retrieval (ECIR’07) (pp. 418–430).
[4]
Frisse, M. E. (1987). Searching for information in a hypertext medical handbook. In HYPERTEXT ’87: Proceeding of the ACM Conference on Hypertext (pp. 57–66). New York: ACM Press.
[5]
Furuta R., Plaisant C., and Shneiderman B. A spectrum of automatic hypertext constructions Hypermedia 1989 1 2 179-195
[6]
Grimmett, G., & Stirzaker, D. (1989). Probability and random processes. Oxford University Press.
[7]
Haveliwala T. H. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search Knowledge and Data Engineering, IEEE Transactions 2003 15 4 784-796
[8]
Hiemstra, D., & Kraaij, W. (1998). Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proceedings of Seventh Text REtrieval Conference (TREC-7) (pp. 227–238).
[9]
Kleinberg J. M. Authoritative sources in a hyperlinked environment Journal of the ACM 1999 46 5 604-632
[10]
Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In SIGIR ’04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 194–201). New York: ACM Press.
[11]
Kurland, O., & Lee, L. (2005). Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In SIGIR ’05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 306–313). New York: ACM Press.
[12]
Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). New York: ACM Press.
[13]
Lavrenko, V., & Croft, B. W. (2001). Relevance based language models. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 120–127). New York: ACM Press.
[14]
Lempel, R., & Moran, S. (2000). The stochastic approach for link-structure analysis (salsa) and the tkc effect. In Proceedings of the 9th International World Wide Web Conference on Computer Networks: the International Journal of Computer and Telecommunications Netoworking (pp. 387–401). Amsterdam: North-Holland Publishing Co.
[15]
Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR ’04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 186–193). New York: ACM Press.
[16]
Miller, D. R. H., Leek, T., & Schwartz, R. M. (1999). A hidden markov model information retrieval system. In SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 214–221). New York: ACM Press.
[17]
Ogilvie, P., & Callan, J. (2003). Combining document representations for known-item search. In SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (pp. 143–150). New York: ACM Press.
[18]
Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library.
[19]
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–281). New York: ACM Press.
[20]
Rocchio, J. (1971). Relevance feedback in information retrieval. In In The SMART Retrieval System: Experiments in Automatic Document Processing (pp. 313–323). PrenticeHall.
[21]
Shakery, A., & Zhai, C. (2006). A probabilistic relevance propagation model for hypertext retrieval. In CIKM ’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (pp. 550–558). New York: ACM Press.
[22]
Si, L., & Callan, J. (2005). Modeling search engine effectiveness for federated search. In SIGIR ’05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 83–90). New York: ACM Press.
[23]
Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 407–414). Morristown: Association for Computational Linguistics.
[24]
Tombros, A. (2002). The effectiveness of query-based hierarchic clustering of documents for information retrieval. Technical Report, PhD thesis, University of Glasgow.
[25]
van Rijsbergen, C. J. (1979). Information retrieval. Butterworth.
[26]
Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 188–196). New York: ACM Press.
[27]
Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval (Digital Libraries and Electronic Publishing). The MIT Press.
[28]
Wilkinson R. and Smeaton A. F. Automatic link generation ACM Computing Surveys 1999 31 4es 27
[29]
Willett P. Recent trends in hierarchic document clustering: A critical review Information Processing and Management 1988 24 5 577-597
[30]
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 4–11). New York: ACM Press.
[31]
Zhai, C., & Lafferty, J. (2001a). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 403–410). New York: ACM Press.
[32]
Zhai, C., & Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 334–342). New York: ACM Press.
[33]
Zhai, C., & Lafferty, J. (2001c). Two stage language models for information retrieval. In SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49–56). New York: ACM Press.

Cited By

View all
  • (2020)Learning a Joint Search and Recommendation Model from User-Item InteractionsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371818(717-725)Online publication date: 20-Jan-2020
  • (2013)Reading contexts for structured documents retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491760(47-52)Online publication date: 15-May-2013
  • (2010)Exploiting site-level information to improve web searchProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871630(1393-1396)Online publication date: 26-Oct-2010
  • Show More Cited By

Index Terms

  1. Smoothing document language models with probabilistic term count propagation
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Information Retrieval
    Information Retrieval  Volume 11, Issue 2
    Apr 2008
    96 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 April 2008
    Accepted: 11 December 2007
    Received: 25 June 2007

    Author Tags

    1. Language models
    2. Probabilistic propagation
    3. Smoothing
    4. Term count propagation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Learning a Joint Search and Recommendation Model from User-Item InteractionsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371818(717-725)Online publication date: 20-Jan-2020
    • (2013)Reading contexts for structured documents retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491760(47-52)Online publication date: 15-May-2013
    • (2010)Exploiting site-level information to improve web searchProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871630(1393-1396)Online publication date: 26-Oct-2010
    • (2010)Mining neighbors' topicality to better control authority flowProceedings of the 32nd European conference on Advances in Information Retrieval10.1007/978-3-642-12275-0_69(653-657)Online publication date: 28-Mar-2010
    • (2008)Statistical Language Models for Information Retrieval A Critical ReviewFoundations and Trends in Information Retrieval10.1561/15000000082:3(137-213)Online publication date: 1-Mar-2008
    • (2008)A general optimization framework for smoothing language models on graph structuresProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390438(611-618)Online publication date: 20-Jul-2008

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media