More Web Proxy on the site http://driver.im/

research-article

Efficient Search Engine Measurements

Authors:

Ziv Bar-Yossef,

Maxim GurevichAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 5, Issue 4

Article No.: 18, Pages 1 - 48

https://doi.org/10.1145/2019643.2019645

Published: 01 October 2011 Publication History

Abstract

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.

We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.

By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.

References

[1]

Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2006. Sampling search-engine results. World Wide Web 9, 4, 397--429.

Digital Library

[2]

Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th Annual Conference on Very Large Databases. 535--544.

Digital Library

[3]

Bar-Yossef, Z. and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). ACM, New York, 401--410.

Digital Library

[4]

Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion sampling. Proc. VLDB 1, 1, 54--65.

Digital Library

[5]

Bar-Yossef, Z. and Gurevich, M. 2008b. Random sampling from a search engine’s index. J. ACM 55, 5, 1--74.

Digital Library

[6]

Bar-Yossef, Z. and Gurevich, M. 2009. Estimating the impressionrank of web pages. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 41--50.

Digital Library

[7]

Baykan, E., Henzinger, M. R., Keller, S. F., Castelberg, S. D., and Kinzler, M. 2009. A comparison of techniques for sampling web pages. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.

[8]

Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web. ACM, New York, 379--388.

Digital Library

[9]

Bradlow, E. T. and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62.

Digital Library

[10]

Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th International Conference on Information and Knowledge Management (CIKM). ACM, New York, 594--603.

Digital Library

[11]

Casella, G. and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika 83, 1, 81--94.

[12]

Cheney, M. and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. http://www.mostpopularsites.net/MPS_News/Search_Engine_Facts/A_Comparison_of_the_Size_of_the_Yahoo!_and_Google_Indices.

[13]

Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 629--640.

Digital Library

[14]

dmoz. The open directory project. http://dmoz.org.

[15]

Dobra, A. and Fienberg, S. E. 2004. How large is the worldwide web? In Web Dynamics, 23--44.

[16]

Goldreich, O. 1997. A sample of samplers - A computational perspective on sampling (survey). Electron. Colloq. Computat. Complex. (ECCC) 4, 20.

[17]

Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on the World Wide Web (WWW). 902--903.

Digital Library

[18]

Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.

[19]

Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International Conference on the World Wide Web (WWW). 213--225.

Digital Library

[20]

Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International Conference on the World Wide Web (WWW). 295--308.

Digital Library

[21]

Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.

[22]

Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.

[23]

Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.

[24]

Lee, S.-M. and Chao, A. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50, 1, 88--97.

[25]

Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.

[26]

Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer.

Digital Library

[27]

Marshall, A. W. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods. 123--140.

[28]

McCurley, K. 2007. Income inequality in the attention economy. http://mccurley.org/papers/effective.

[29]

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.

[30]

Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation.

[31]

Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag.

[32]

Stuart, A. and Ord, J. K. 1994. Kendall’s Advanced Theory of Statistics. Vol.1: Distribution Theory. Hodder Arnold, London.

[33]

von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.

Cited By

Katzir LShikhelman CYogev E(2020)Interactive Proofs for Social GraphsAdvances in Cryptology – CRYPTO 202010.1007/978-3-030-56877-1_20(574-601)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-56877-1_20
Chahal PSingh MKumar S(2017)Semantic Analysis Based Approach for Relevant Text Extraction Using OntologyInternational Journal of Information Retrieval Research10.4018/IJIRR.20171001027:4(19-36)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.4018/IJIRR.2017100102
Islam FHassaine AJaoua ADas GZhang N(2017)Individual Query Cardinality Estimation using Multiple Query Combinations on a Search Engine's Corpus2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079753(312-316)Online publication date: Sep-2017
https://doi.org/10.1109/COMAPP.2017.8079753
Show More Cited By

Index Terms

Efficient Search Engine Measurements
1. Information systems
  1. Information retrieval

Recommendations

Efficient search engine measurements
WWW '07: Proceedings of the 16th international conference on World Wide Web

We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor ...
Random sampling from a search engine's index

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in ...
What users see - Structures in search engine results pages

This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 5, Issue 4

October 2011

154 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/2019643

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2011

Accepted: 01 December 2010

Revised: 01 July 2010

Received: 01 August 2009

Published in TWEB Volume 5, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
896
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Katzir LShikhelman CYogev E(2020)Interactive Proofs for Social GraphsAdvances in Cryptology – CRYPTO 202010.1007/978-3-030-56877-1_20(574-601)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-56877-1_20
Chahal PSingh MKumar S(2017)Semantic Analysis Based Approach for Relevant Text Extraction Using OntologyInternational Journal of Information Retrieval Research10.4018/IJIRR.20171001027:4(19-36)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.4018/IJIRR.2017100102
Islam FHassaine AJaoua ADas GZhang N(2017)Individual Query Cardinality Estimation using Multiple Query Combinations on a Search Engine's Corpus2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079753(312-316)Online publication date: Sep-2017
https://doi.org/10.1109/COMAPP.2017.8079753
Bosch ABogers TKunder M(2016)Estimating search engine index size variabilityScientometrics10.1007/s11192-016-1863-z107:2(839-856)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s11192-016-1863-z
Trushkowsky BKraska TFranklin MSarkar P(2015)Answering enumeration queries with the crowdCommunications of the ACM10.1145/284564459:1(118-127)Online publication date: 21-Dec-2015
https://dl.acm.org/doi/10.1145/2845644
Barrio PGravano LDevelder CBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Ranking Deep Web Text Collections for Scalable Information ExtractionProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806581(153-162)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806581
Katzir LHardiman S(2015)Estimating Clustering Coefficients and Size of Social Networks via Random WalkACM Transactions on the Web10.1145/27903049:4(1-20)Online publication date: 28-Sep-2015
https://dl.acm.org/doi/10.1145/2790304
Trushkowsky BKraska TFranklin MSarkar PRamachandran V(2015)Crowdsourcing Enumeration Queries: Estimators and InterfacesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.233985727:7(1796-1809)Online publication date: 1-Jul-2015
https://doi.org/10.1109/TKDE.2014.2339857
Li RYu JQin LMao RJin T(2015)On random walk based graph sampling2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113345(927-938)Online publication date: Apr-2015
https://doi.org/10.1109/ICDE.2015.7113345
Dasgupta AKumar RSarlos TChung CBroder AShim KSuel T(2014)On estimating the average degreeProceedings of the 23rd international conference on World wide web10.1145/2566486.2568019(795-806)Online publication date: 7-Apr-2014
https://dl.acm.org/doi/10.1145/2566486.2568019
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents