[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1935826.1935907acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Document assignment in multi-site search engines

Published: 09 February 2011 Publication History

Abstract

Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

References

[1]
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, and L. Telloli. On the feasibility of multi-site web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 425--434, 2009.
[2]
R. Baeza-Yates, C. Middleton, and C. Castillo. The geographical life of search. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 252--259, 2009.
[3]
R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163--170, 2009.
[4]
L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22--28, 2003.
[5]
M. Bawa, G. S. Manku, and P. Raghavan. Sets: Search enhanced by topic segmentation. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 306--313, 2003.
[6]
B. Bohnet. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, 2010.
[7]
L. Bottou and Y. LeCun. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(2):137--151, 2005.
[8]
J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Recent Research from the Center for Intelligent Information Retrieval, chapter 5, pages 127--150. Kluwer Academic Publishers, 2000.
[9]
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--28, 1995.
[10]
B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web, pages 181--190, 2010.
[11]
B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 411--418, 2009.
[12]
B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proceedings of the 3rd International Conference on Scalable Information Systems, 2008.
[13]
B. B. Cambazoglu, E. Varol, E. Kayaaslan, C. Aykanat, and R. Baeza-Yates. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 90--97, 2010.
[14]
K. Church, A. Greenberg, and J. Hamilton. On delivering embarrassingly distributed cloud services. In Proceedings of the 7th ACM Workshop on Hot Topics in Networks, 2008.
[15]
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273--297, 1995.
[16]
J. Hoffmann, M. Spranger, G. Daniel, J. Matthias, and H.-D. Burkhard. Further studies on the use of negative information in mobile robot localization. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 62--67, 2006.
[17]
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998.
[18]
T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2006.
[19]
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627--650, 2008.
[20]
Z. Lu and K. S. McKinley. Partial replica selection based on relevance for information retrieval. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97--104, 1999.
[21]
Z. Lu and K. S. McKinley. Partial collection replication versus caching for information retrieval systems. In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 248--255, 2000.
[22]
W. Meng, C. Yu, and K.-L. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48--89, 2002.
[23]
S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345--389, 1998.
[24]
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 67--73, 1997.
[25]
S. Orlando, R. Perego, and F. Silvestri. Design of a parallel and distributed web search engine. In Proceedings of the Parallel Computing Conference, pages 197--204, 2001.
[26]
D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates.Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems, 28:1--36, 2010.
[27]
C. Sarigiannis, V. Plachouras, and R. Baeza-Yates. A study of the impact of index updates on distributed query processing for web search. In Proceedings of the 31th European Conference on Information Retrieval, pages 595--602, 2009.
[28]
E. Schurman and J. Brutlag. Performance related changes and their user impact. In Velocity: Web Performance and Operations Conference, 2009.
[29]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[30]
F. Sebastiani, A. Sperduti, and N. Valdambrini. An improved boosting algorithm and its application to automated text categorization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management, pages 78--85, 2000.
[31]
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807--814, 2007.
[32]
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, 2003.
[33]
M. Stone. Cross-validation: A review. Math. Operationsforsch. Statist. Ser. Statistics, 9(1):127--129, 1978.
[34]
C. Tang, Z. Xu, and M. Mahalingam. PeerSearch: Efficient information retrieval in peer-to-peer networks. In Proceedings of HotNets-I, ACM SIGCOMM, 2002.
[35]
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1--2):69--90, 1999.
[36]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 42--49, 1999.
[37]
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219--241, 2002.

Cited By

View all
  • (2018)Measuring the Effectiveness of Selective Search Index Partitions without SupervisionProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234952(91-98)Online publication date: 10-Sep-2018
  • (2016)Efficient distributed selective searchInformation Retrieval Journal10.1007/s10791-016-9290-620:3(221-252)Online publication date: 25-Nov-2016
  • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
  • Show More Cited By

Index Terms

  1. Document assignment in multi-site search engines

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
    February 2011
    870 pages
    ISBN:9781450304931
    DOI:10.1145/1935826
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 February 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. document replication
    3. multi-site web search engines

    Qualifiers

    • Poster

    Conference

    Acceptance Rates

    WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Measuring the Effectiveness of Selective Search Index Partitions without SupervisionProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234952(91-98)Online publication date: 10-Sep-2018
    • (2016)Efficient distributed selective searchInformation Retrieval Journal10.1007/s10791-016-9290-620:3(221-252)Online publication date: 25-Nov-2016
    • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
    • (2014)Improving the efficiency of multi-site web search enginesProceedings of the 7th ACM international conference on Web search and data mining10.1145/2556195.2556249(3-12)Online publication date: 24-Feb-2014
    • (2013)Document replication strategies for geographically distributed web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2012.01.00249:1(51-66)Online publication date: 1-Jan-2013
    • (2012)Reactive index replication for distributed search enginesProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348394(831-840)Online publication date: 12-Aug-2012
    • (2011)Assigning documents to master sites in distributed searchProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063591(67-76)Online publication date: 24-Oct-2011

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media