[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Efficient Search Engine Measurements

Published: 01 October 2011 Publication History

Abstract

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.
We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.
By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.

References

[1]
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2006. Sampling search-engine results. World Wide Web 9, 4, 397--429.
[2]
Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th Annual Conference on Very Large Databases. 535--544.
[3]
Bar-Yossef, Z. and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). ACM, New York, 401--410.
[4]
Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion sampling. Proc. VLDB 1, 1, 54--65.
[5]
Bar-Yossef, Z. and Gurevich, M. 2008b. Random sampling from a search engine’s index. J. ACM 55, 5, 1--74.
[6]
Bar-Yossef, Z. and Gurevich, M. 2009. Estimating the impressionrank of web pages. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 41--50.
[7]
Baykan, E., Henzinger, M. R., Keller, S. F., Castelberg, S. D., and Kinzler, M. 2009. A comparison of techniques for sampling web pages. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.
[8]
Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web. ACM, New York, 379--388.
[9]
Bradlow, E. T. and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62.
[10]
Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th International Conference on Information and Knowledge Management (CIKM). ACM, New York, 594--603.
[11]
Casella, G. and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika 83, 1, 81--94.
[12]
Cheney, M. and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. http://www.mostpopularsites.net/MPS_News/Search_Engine_Facts/A_Comparison_of_the_Size_of_the_Yahoo!_and_Google_Indices.
[13]
Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 629--640.
[14]
dmoz. The open directory project. http://dmoz.org.
[15]
Dobra, A. and Fienberg, S. E. 2004. How large is the worldwide web? In Web Dynamics, 23--44.
[16]
Goldreich, O. 1997. A sample of samplers - A computational perspective on sampling (survey). Electron. Colloq. Computat. Complex. (ECCC) 4, 20.
[17]
Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on the World Wide Web (WWW). 902--903.
[18]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.
[19]
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International Conference on the World Wide Web (WWW). 213--225.
[20]
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International Conference on the World Wide Web (WWW). 295--308.
[21]
Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.
[22]
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.
[23]
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
[24]
Lee, S.-M. and Chao, A. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50, 1, 88--97.
[25]
Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.
[26]
Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer.
[27]
Marshall, A. W. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods. 123--140.
[28]
McCurley, K. 2007. Income inequality in the attention economy. http://mccurley.org/papers/effective.
[29]
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.
[30]
Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation.
[31]
Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag.
[32]
Stuart, A. and Ord, J. K. 1994. Kendall’s Advanced Theory of Statistics. Vol.1: Distribution Theory. Hodder Arnold, London.
[33]
von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.

Cited By

View all
  • (2020)Interactive Proofs for Social GraphsAdvances in Cryptology – CRYPTO 202010.1007/978-3-030-56877-1_20(574-601)Online publication date: 17-Aug-2020
  • (2017)Semantic Analysis Based Approach for Relevant Text Extraction Using OntologyInternational Journal of Information Retrieval Research10.4018/IJIRR.20171001027:4(19-36)Online publication date: 1-Oct-2017
  • (2017)Individual Query Cardinality Estimation using Multiple Query Combinations on a Search Engine's Corpus2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079753(312-316)Online publication date: Sep-2017
  • Show More Cited By

Index Terms

  1. Efficient Search Engine Measurements

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web  Volume 5, Issue 4
    October 2011
    154 pages
    ISSN:1559-1131
    EISSN:1559-114X
    DOI:10.1145/2019643
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2011
    Accepted: 01 December 2010
    Revised: 01 July 2010
    Received: 01 August 2009
    Published in TWEB Volume 5, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Corpus size estimation
    2. evaluation
    3. sampling
    4. search engines

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Interactive Proofs for Social GraphsAdvances in Cryptology – CRYPTO 202010.1007/978-3-030-56877-1_20(574-601)Online publication date: 17-Aug-2020
    • (2017)Semantic Analysis Based Approach for Relevant Text Extraction Using OntologyInternational Journal of Information Retrieval Research10.4018/IJIRR.20171001027:4(19-36)Online publication date: 1-Oct-2017
    • (2017)Individual Query Cardinality Estimation using Multiple Query Combinations on a Search Engine's Corpus2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079753(312-316)Online publication date: Sep-2017
    • (2016)Estimating search engine index size variabilityScientometrics10.1007/s11192-016-1863-z107:2(839-856)Online publication date: 1-May-2016
    • (2015)Answering enumeration queries with the crowdCommunications of the ACM10.1145/284564459:1(118-127)Online publication date: 21-Dec-2015
    • (2015)Ranking Deep Web Text Collections for Scalable Information ExtractionProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806581(153-162)Online publication date: 17-Oct-2015
    • (2015)Estimating Clustering Coefficients and Size of Social Networks via Random WalkACM Transactions on the Web10.1145/27903049:4(1-20)Online publication date: 28-Sep-2015
    • (2015)Crowdsourcing Enumeration Queries: Estimators and InterfacesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.233985727:7(1796-1809)Online publication date: 1-Jul-2015
    • (2015)On random walk based graph sampling2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113345(927-938)Online publication date: Apr-2015
    • (2014)On estimating the average degreeProceedings of the 23rd international conference on World wide web10.1145/2566486.2568019(795-806)Online publication date: 7-Apr-2014
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media