[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1008992.1009027acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Corpus structure, language models, and ad hoc information retrieval

Published: 25 July 2004 Publication History

Abstract

Most previous work on the recently developed language-modeling approach to information retrieval focuses on document-specific characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically significant improvements for both metrics over all three corpora tested.

References

[1]
James Allan, M.E. Connel, W. Bruce Croft, Fang-Fang Feng, D. Fisher, and X. Li. Inquery and trec-9. In Proceedings of the Ninth Text Retrieval Conference (TREC-9), pages 551--562, 2001.
[2]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[3]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.
[4]
W. Bruce Croft. A model of cluster searching based on classification. Information Systems, 5:189--195, 1980.
[5]
W. Bruce Croft and John Lafferty, editors. Language Modeling for Information Retrieval. Kluwer, 2003.
[6]
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, pages 76--84, 1996.
[7]
Djoerd Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: The importance of a query term. In Proceedings of SIGIR, 2002.
[8]
Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177--196, 2001.
[9]
Thomas Hofmann and Jan Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Institute (ICSI), 1998.
[10]
Rukmini Iyer and Mari Ostendorf. Modeling long distance dependence in language: Topic mixtures vs. dynamic cache models. IEEE Transactions on Speech and Audio Processing, 7(1):30--39, 1999.
[11]
John D. Lafferty and Chengxiang Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.
[12]
Victor Lavrenko. Optimal mixture models in IR. In European Conference on Information Retrieval, pages 193--212, 2002.
[13]
Victor Lavrenko and W. Bruce Croft. Relevance-based language models. In Proceedings of SIGIR, pages 120--127, 2001.
[14]
Paul Ogilvie and Jamie Callan. Experiments using the lemur toolkit. In Proceedings of the Tenth Text Retrieval Conference (TREC-10), pages 103--108, 2001.
[15]
Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.
[16]
C. J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1979.
[17]
Jinxi Xu and W. Bruce Croft. Cluster-based language models for distributed retrieval. In Proceedings of SIGIR, pages 254--261, 1999.
[18]
Hugo Zaragoza, Djoerd Hiemstra, and Michael Tipping. Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR, pages 4--9, 2003.
[19]
Chengxiang Zhai and John D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages 334--342, 2001.

Cited By

View all
  • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
  • (2023)Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq dataScientific Reports10.1038/s41598-023-32966-x13:1Online publication date: 21-Apr-2023
  • (2023)Novel Framework for Improving the Correctness of Reference Answers to Enhance Results of ASAG SystemsSN Computer Science10.1007/s42979-023-01682-84:4Online publication date: 24-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. aspect models
  2. cluster-based language models
  3. clustering
  4. interpolation model
  5. language modeling
  6. smoothing

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
  • (2023)Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq dataScientific Reports10.1038/s41598-023-32966-x13:1Online publication date: 21-Apr-2023
  • (2023)Novel Framework for Improving the Correctness of Reference Answers to Enhance Results of ASAG SystemsSN Computer Science10.1007/s42979-023-01682-84:4Online publication date: 24-May-2023
  • (2023)Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothingSādhanā10.1007/s12046-023-02258-148:4Online publication date: 4-Oct-2023
  • (2022)Adaptive Re-Ranking with a Corpus GraphProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557231(1491-1500)Online publication date: 17-Oct-2022
  • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
  • (2022)Competitive SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532771(2838-2849)Online publication date: 6-Jul-2022
  • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
  • (2021)Deep Query Likelihood Model for Information RetrievalAdvances in Information Retrieval10.1007/978-3-030-72240-1_49(463-470)Online publication date: 30-Mar-2021
  • (2019)Cluster-Based Focused RetrievalProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358087(2305-2308)Online publication date: 3-Nov-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media