Article

Free access

A general language model for information retrieval

Authors:

Fei Song,

W. Bruce CroftAuthors Info & Claims

CIKM '99: Proceedings of the eighth international conference on Information and knowledge management

Pages 316 - 321

https://doi.org/10.1145/319950.320022

Published: 01 November 1999 Publication History

PDF eReader

Abstract

Statistical language modeling has been successfully used for speech recognition, part-of-speech tagging, and syntactic parsing. Recently, it has also been applied to information retrieval. According to this new paradigm, each document is viewed as a language sample, and a query as a generation process. The retrieved documents are ranked based on the probabilities of producing a query from the corresponding language models of these documents. In this paper, we will present a new language model for information retrieval, which is based on a range of data smoothing techniques, including the Good-Turning estimate, curve-fitting functions, and model combinations. Our model is conceptually simple and intuitive, and can be easily extended to incorporate probabilities of phrases such as word pairs and word triples. The experiments with the Wall Street Journal and TREC4 data sets showed that the performance of our model is comparable to that of INQUERY and better than that of another language model for information retrieval. In particular, word pairs are shown to be useful in improving the retrieval performance.

References

[1]

Callan, J.P., Croft, W.B., and Broglio, J. TREC and TIPS~R cxpcrimonts with iNQUERY. Information Processing and Management, 31(3): 327-343, 1995.]]

Digital Library

Google Scholar

[2]

Chamiak, E. Statistical Language Learning. The M1T Press, Cambridge MA, 1993.]]

Google Scholar

[3]

Croft, W.B., and Turtle, H.R. Text Retrieval and Inference. In Text-Based Intelligent Systems, edj.'ted by Paul S. Jacob, pages 127-155, Lawrence Ertbaum Associates, Publishers, 1992.]]

Crossref

Google Scholar

[4]

Fralces, W.B., and Baeza-Yates, R. (editors). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, New Jersey: Prentice Hall, 1992.]]

Google Scholar

[5]

Hiemstra, D. A Linguistically Motivated Probabilistic Model of Information Retrieval. Second European Conference on Digital Libraries, pages 569-584, 1998.]]

Digital Library

Google Scholar

[6]

Leek, T., Miller, D.R.H., and Schwartz, R.M. A Hidden Markov Model Information Retrieval System. TREC-7 PrOngs, 1998.]]

Google Scholar

[7]

Manning, C., and Schtitze, H. Foundations of Statistical Natural Language Prcr. essing. The M1T Press, 1999.]]

Digital Library

Google Scholar

[8]

Miller, D.R.H., Leek, T., and Schwartz, R.M. A Hidden Markov Model Information Retrieval System. In Pro~edings of SIGIR99, pages 214-221. University of California, Berkeloy, Aug., 1999.]]

Digital Library

Google Scholar

[9]

Ponte, J.M. A Language Modeling Approach to Information Retrieval. Ph.D. thesis, University of Massachusetts at Amherst, 1998.]]

Digital Library

Google Scholar

[10]

Ponte, J.M., and Croft, W.B. A Language Modeling Approach to Information Retrieval. In PrOngs of SIGIR'98, pages 275-281. Melbourne, Ausaalia, 1998.]]

Digital Library

Google Scholar

[11]

Robcrtson, S.E. The probability ranking principle in IR. journal of Documentation, 33(4): 294-304, Decem~r 1977.]]

Google Scholar

[12]

Salton, G. Automatic Information Organization and Retrieval. McCrraw-Hill, 1968.]]

Digital Library

Google Scholar

Cited By

View all

Sánchez-Puig FLozano-Aranda RPérez-Méndez DColman EMorales-Guzmán ARivera Torres PPineda CGershenson C(2024)Language Statistics at Different Spatial, Temporal, and Grammatical ScalesEntropy10.3390/e2609073426:9(734)Online publication date: 29-Aug-2024
https://doi.org/10.3390/e26090734
Prosser EEdwards M(2024)Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming PreventionProceedings of the 2024 European Interdisciplinary Cybersecurity Conference10.1145/3655693.3655694(1-10)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1145/3655693.3655694
Wilson ESaxena AMahajan JPanikulangara LKulkarni SJain P(2024)FIN2SUM: Advancing AI-Driven Financial Text Summarization with LLMs2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies10.1109/TQCEBT59414.2024.10545078(1-5)Online publication date: 22-Mar-2024
https://doi.org/10.1109/TQCEBT59414.2024.10545078
Show More Cited By

Index Terms

A general language model for information retrieval
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Query languages
  2. Information retrieval
    1. Retrieval models and ranking
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query languages (principles)

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish
Fractional calculus applications in signals and systems

We have designed a Turkish dictation system for newspaper content transcription application. Turkish is an agglutinative language with free word order. These characteristics of the language result in vocabulary explosion, large number of out-of-...
Enhancing recurrent neural network-based language models by word tokenization

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks ...
Enhancing information retrieval through statistical natural language processing: a study of collocation indexing

Although the management of information assets-specifically, of text documents that make up 80 percent of these assets-an provide organizations with a competitive advantage, the ability of information retrieval (IR) systems to deliver relevant ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CIKM '99: Proceedings of the eighth international conference on Information and knowledge management

November 1999

564 pages

ISBN:1581131461

DOI:10.1145/319950

Editor:
Susan Gauch

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1999

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM99

Sponsor:

CIKM99: Conference on Information and Knowledge Management

November 2 - 6, 1999

Missouri, Kansas City, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

278
Total Citations
View Citations
4,410
Total Downloads

Downloads (Last 12 months)731
Downloads (Last 6 weeks)66

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Sánchez-Puig FLozano-Aranda RPérez-Méndez DColman EMorales-Guzmán ARivera Torres PPineda CGershenson C(2024)Language Statistics at Different Spatial, Temporal, and Grammatical ScalesEntropy10.3390/e2609073426:9(734)Online publication date: 29-Aug-2024
https://doi.org/10.3390/e26090734
Prosser EEdwards M(2024)Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming PreventionProceedings of the 2024 European Interdisciplinary Cybersecurity Conference10.1145/3655693.3655694(1-10)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1145/3655693.3655694
Wilson ESaxena AMahajan JPanikulangara LKulkarni SJain P(2024)FIN2SUM: Advancing AI-Driven Financial Text Summarization with LLMs2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies10.1109/TQCEBT59414.2024.10545078(1-5)Online publication date: 22-Mar-2024
https://doi.org/10.1109/TQCEBT59414.2024.10545078
Chandra PThangaraj A(2024)Missing Mass Under Random Duplications2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619664(522-526)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ISIT57864.2024.10619664
Sinhababu NKhatun R(2024)LEq: Large Language Models Generate Expanded Queries for Searching2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725314(1-4)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICCCNT61001.2024.10725314
Liu YTan TZhan X(2024)Iterative Self-Supervised Learning for Legal Similar Case RetrievalIEEE Access10.1109/ACCESS.2024.335862212(17231-17241)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3358622
Painsky A(2024)Confidence Intervals for Parameters of Unobserved EventsJournal of the American Statistical Association10.1080/01621459.2024.2314318(1-20)Online publication date: 7-Feb-2024
https://doi.org/10.1080/01621459.2024.2314318
Ingram WWu JKahu SManzoor JBanerjee BAhuja AChoudhury MSalsabil LShields WFox E(2024)Building datasets to support information extraction and structure parsing from electronic theses and dissertationsInternational Journal on Digital Libraries10.1007/s00799-024-00395-425:2(175-196)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00799-024-00395-4
Arabzadeh NGolzadeh KRisi CClarke CZhao J(2024)KnowFIRES: A Knowledge-Graph Framework for Interpreting Retrieved Entities from SearchAdvances in Information Retrieval10.1007/978-3-031-56069-9_15(182-188)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56069-9_15
Arabzadeh NBigdeli ABagheri E(2024)LaQuE: Enabling Entity Search at ScaleAdvances in Information Retrieval10.1007/978-3-031-56060-6_18(270-285)Online publication date: 16-Mar-2024
https://doi.org/10.1007/978-3-031-56060-6_18
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish

Enhancing recurrent neural network-based language models by word tokenization

Enhancing information retrieval through statistical natural language processing: a study of collocation indexing