[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/383952.384005acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Modeling score distributions for combining the outputs of search engines

Published: 01 September 2001 Publication History

Abstract

In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese.
It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics.
This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.

References

[1]
A. Arampatzis, J. Beney, C. H. A. Koster, and T. P. van der Weide. Incrementality, half-life and threshold optimization for adaptive document filtering. In Proc. of the 9th Text Retrieval Conference (TREC-9). NIST, Nov 2000, To be published in late 2001.
[2]
J. A. Aslam, and M. Montague. Bayes optimal metasearch: A probabilistic model for combining the results of multiple retrieval systems. In the Proc. of the 23rd ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 379-381, 2000.
[3]
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
[4]
A. Bookstein. When the most Pertinent document should not be retrieved - an analysis of the Swets model. Information Processing and Management, 13:377-383, 1977.
[5]
J. Callan, Z. Lu, and W. B. Croft. TREC and TIPSTER experiments with INQUERY. In the Proc. of the 18th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 21-28, 1995.
[6]
K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163-190, 1995.
[7]
W. B. Croft. Combining approaches to information retrieval. In W. B. Croft, editor, Advances in Information Retrieval, pages 1-36. Kluwer Academic Publishers, 2000.
[8]
R. Fagin. Fuzzy queries in multimedia database systems. In the Proc. of the 17th ACM Conference on Prnciples of Database Systems (PODS), pages 1-10, 1998.
[9]
M. Flickner, H. S. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The QBIC system. IEEE Computer Magazine, 28(9):23-30, Sept. 1995.
[10]
E. Fox and J. Shaw. Combination of multiple searches. In the Proc. of the 2nd Text Retrieval Conference (TREC-2), pages 243-252. National Institute of Standards and Technology Special Publications 500-215, 1994.
[11]
W. Greiff. The use of exploratory data analysis in information retrieval research. In W. B. Croft, editor, Advances in Information Retrieval, pages 37-72. Kluwer Academic Publishers, 2000.
[12]
S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 20:197-206, 1975.
[13]
J. H. Lee. Combining multiple evidence form different properties of weighting schemes. In the Proc. of the 18th Intl. Conf. on Research and Development in Information Retrieval (SIGIR'95), pages 180-188, 1995.
[14]
J. H. Lee. Analyses of multiple evidence combination. In the Proc. of the 20th Intl. Conf. on Research and Development in Information Retrieval (SIGIR'97), pages 267-276, 1997.
[15]
G. McLachlan and D. Peel. Finite Mixture Models. John Wiley, 2000.
[16]
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison Weseley, 1964.
[17]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In the Proc. of the 17th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 232-241, 1994.
[18]
J. A. Swets. Information retrieval systems. Science, 141:245-250, 1963.
[19]
K. Tumer and J. Ghosh. Linear and order statistics combiners for pattern clasification. In A. Sharkey, editor, Combining Artificial Neural Networks, pages 127-162. Springer-Verlag, 1999.
[20]
C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
[21]
C. Vogt and G. Cottrell. Predicting the performance of linearly combined IR systems. In the Proc. of the 21st ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 190-196, 1998.
[22]
E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In the Proc. of the 18th ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 172-179, 1995.

Cited By

View all
  • (2024)Improving Consumer Health Search with Field-Level Learning-to-Rank TechniquesInformation10.3390/info1511069515:11(695)Online publication date: 3-Nov-2024
  • (2023)Surprise: Result List Truncation via Extreme Value TheoryProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592066(2404-2408)Online publication date: 19-Jul-2023
  • (2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
September 2001
454 pages
ISBN:1581133316
DOI:10.1145/383952
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGIR01
Sponsor:

Acceptance Rates

SIGIR '01 Paper Acceptance Rate 47 of 201 submissions, 23%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Consumer Health Search with Field-Level Learning-to-Rank TechniquesInformation10.3390/info1511069515:11(695)Online publication date: 3-Nov-2024
  • (2023)Surprise: Result List Truncation via Extreme Value TheoryProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592066(2404-2408)Online publication date: 19-Jul-2023
  • (2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
  • (2022)Stochastic Retrieval-Conditioned RerankingProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545141(81-91)Online publication date: 23-Aug-2022
  • (2022)MtCutProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498466(1054-1062)Online publication date: 11-Feb-2022
  • (2022)Cost-Oriented Candidate Screening Using Machine Learning AlgorithmsRecent Challenges in Intelligent Information and Database Systems10.1007/978-981-19-8234-7_57(737-750)Online publication date: 24-Nov-2022
  • (2021)A Graph-Based Approach for Making Consensus-Based Decisions in Image Search and Person Re-IdentificationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.294459743:3(753-765)Online publication date: 1-Mar-2021
  • (2020)On the Evaluation of Data Fusion for Information RetrievalProceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3441501.3441506(54-57)Online publication date: 16-Dec-2020
  • (2020)Choppy: Cut Transformer for Ranked List TruncationProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401188(1513-1516)Online publication date: 25-Jul-2020
  • (2020)Aggregation on Learning to Rank for Consumer Health Information RetrievalModelling and Development of Intelligent Systems10.1007/978-3-030-39237-6_6(81-93)Online publication date: 17-Jan-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media