article

Searching the Web

Authors:

Arvind Arasu,

Junghoo Cho,

Hector Garcia-Molina,

Andreas Paepcke,

Sriram RaghavanAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1

Pages 2 - 43

https://doi.org/10.1145/383034.383035

Published: 01 August 2001 Publication History

Get Access

Abstract

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

References

[1]

AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms. Addison-Wesley, Reading, MA.]]

Digital Library

Google Scholar

[2]

ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).]]

Crossref

Google Scholar

[3]

AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.]]

Digital Library

Google Scholar

[4]

ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 290-297.]]

Digital Library

Google Scholar

[5]

BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Data Bases.]]

Digital Library

Google Scholar

[6]

BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.]]

Crossref

Google Scholar

[7]

BHARAT,K.AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the Eighth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[8]

BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN Syst. 30, 1-7, 469-477.]]

Digital Library

Google Scholar

[9]

BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of the Sixth International Conference on Extending Database Technology (Valencia, Spain, Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.]]

Digital Library

Google Scholar

[10]

BRIN,S.AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107-117.]]

Digital Library

Google Scholar

[11]

BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings of the Ninth International Conference on The World Wide Web.]]

Digital Library

Google Scholar

[12]

CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.]]

Google Scholar

[13]

CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27, 2, 307-318.]]

Digital Library

Google Scholar

[14]

CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG,J. 1998c. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International Conference on The World Wide Web (WWW7, Brisbane, Australia, Apr. 14-18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci. Pub. B. V., Amsterdam, The Netherlands, 65-74.]]

Digital Library

Google Scholar

[15]

CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96, Padua, Italy, June 24-26), G. E. Blelloch, Chair. ACM Press, New York, NY, 329-335.]]

Digital Library

Google Scholar

[16]

CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[17]

CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.]]

Google Scholar

[18]

CHO,J.AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases.]]

Digital Library

Google Scholar

[19]

CHO,J.AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD '2000, Dallas, TX, May). ACM Press, New York, NY.]]

Digital Library

Google Scholar

[20]

CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161-172.]]

Digital Library

Google Scholar

[21]

COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.]]

Google Scholar

[22]

DEAN,J.AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In Proceedings of the Eighth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[23]

DILIGENTI, M., COETZEE,F.M.,LAWRENCE, S., GILES,C.L.,AND GORI, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases.]]

Digital Library

Google Scholar

[24]

DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.]]

Digital Library

Google Scholar

[25]

DUMAIS,S.T.,FURNAS,G.W.,LANDAUER,T.K.,DEERWESTER, S., AND HARSHMAN, R. 1988. Using latent semantic analysis to improve access to textual information. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '88, Washington, DC, May 15-19), J. J. O'Hare, Ed. ACM Press, New York, NY, 281-285.]]

Digital Library

Google Scholar

[26]

EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.]]

Google Scholar

[27]

FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49-74.]]

Digital Library

Google Scholar

[28]

FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.), 267-288.]]

Digital Library

Google Scholar

[29]

GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.]]

Crossref

Google Scholar

[30]

GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems (HYPERTEXT '98, Pittsburgh, PA, June 20-24), R. Akscyn, Chair. ACM Press, New York, NY, 225-234.]]

Digital Library

Google Scholar

[31]

GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]]

Digital Library

Google Scholar

[32]

HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.]]

Google Scholar

[33]

HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).]]

Google Scholar

[34]

HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of web pages. In Proceedings of the Ninth International Conference on The World Wide Web. 277-293.]]

Digital Library

Google Scholar

[35]

HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).]]

Crossref

Google Scholar

[36]

KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46,6 (Nov.).]]

Digital Library

Google Scholar

[37]

KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).]]

Google Scholar

[38]

KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the Eighth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[39]

LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.]]

Crossref

Google Scholar

[40]

LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.]]

Crossref

Google Scholar

[41]

MANBER,U.AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935-948.]]

Digital Library

Google Scholar

[42]

MACLEOD,I.A.,MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '86, Palazzo dei Congressi, Pisa, Italy, Sept. 8-10), F. Rabitti, Ed. ACM Press, New York, NY, 131-137.]]

Digital Library

Google Scholar

[43]

MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.]]

Google Scholar

[44]

MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed full-text index for the web. In Proceedings of the Tenth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[45]

MOFFAT,A.AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am. Soc. Inf. Sci. 46, 7 (Aug.), 537-550.]]

Digital Library

Google Scholar

[46]

MOTWANI,R.AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.]]

Digital Library

Google Scholar

[47]

PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep. Computer Systems Laboratory, Stanford University, Stanford, CA.]]

Google Scholar

[48]

PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.]]

Google Scholar

[49]

PITKOW,J.AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97, Atlanta, GA, Mar. 22-27), S. Pemberton, Ed. ACM Press, New York, NY, 383-390.]]

Digital Library

Google Scholar

[50]

RIBEIRO-NETO,B.A.AND BARBOSA, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the Third ACM Conference on Digital Libraries (DL '98, Pittsburgh, PA, June 23-26), I. Witten, R. Akscyn, and F. M. Shipman, Eds. ACM Press, New York, NY, 182-190.]]

Digital Library

Google Scholar

[51]

ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.]]

Google Scholar

[52]

SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.]]

Digital Library

Google Scholar

[53]

TOMASIC,A.AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text document retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems (Dec.). 8-17.]]

Digital Library

Google Scholar

[54]

VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.]]

Digital Library

Google Scholar

[55]

WILLS,C.E.AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International Conference on The World-Wide Web.]]

Digital Library

Google Scholar

[56]

WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.]]

Digital Library

Google Scholar

Cited By

View all

Bragilovski MStern RSturm A(2024)How do I find reusable models?Software and Systems Modeling (SoSyM)10.1007/s10270-023-01103-723:1(85-102)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10270-023-01103-7
Gaur ASingh VKumar SKaur M(2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-981-97-3690-4_12
ALANOĞLU ZAKÇAYOL M(2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
https://doi.org/10.7240/jeps.1174193
Show More Cited By

Index Terms

Searching the Web
1. Information systems

Recommendations

Content and link-structure perspective of ranking webpages: A review
Abstract
The delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
A Googol of Information about Google

Timothy P. Chartier reviews Google's PageRank and Beyond: The Science of Search Engine Rankings by Amy Langville and Carl Meyer.
Querying the web graph
SPIRE'10: Proceedings of the 17th international conference on String processing and information retrieval

This paper focuses on using hyperlinks in the ranking of web search results. We give a brief overview of the vast body of work in the area; we provide a quantitative comparison of the different features; we sketch how link-based ranking features can be ...

Reviews

Reviewer: Jun Lin

Are you curious about the way Web search engines provide users with a list of URLs after just a few keywords are entered__?__ This article gives an overview on the core engine that makes this possible. The authors start by discussing the challenges of the Internet and the architecture of Web search engines. The major challenges—the size, rapid change, lack of coherence, and interlinked nature of the Internet—introduce the rest of the article. Chapter Two discusses the discovery of information from the Internet, with a detailed explanation of the challenges of crawler models in respect to page selection and page refresh. Chapter Three introduces the storage and distributed Web repository. In chapters Four and Five, the authors present the most popular indexing architectures used in Web search services and ranking and link analysis. The article covers in detail the fundamental components of Web search engines and the most common design and implementation. For each component discussed, a conclusion section is provided to summarize the concepts and give further alert on challenges to be addressed in the future. Theoretical analysis and arguments are supported by the authors’ own experiments in which data and statistics are provided. The article is most suitable for readers who are interested in the design of Web search engines. In addition, readers who want to submit their Web pages to search engines can also be benefit from reading this article to increase the matching-rate and ranking of their Web pages among the search engines. For other readers who search the Internet to find certain information, this article is also a good source to understand the technology involved, though it does not contain direct guidelines towards the most popular Web search engines on the market. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 1, Issue 1

Aug. 2001

140 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/383034

Editor:
Won Kim

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2001

Published in TOIT Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

395
Total Citations
View Citations
9,606
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Bragilovski MStern RSturm A(2024)How do I find reusable models?Software and Systems Modeling (SoSyM)10.1007/s10270-023-01103-723:1(85-102)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10270-023-01103-7
Gaur ASingh VKumar SKaur M(2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-981-97-3690-4_12
ALANOĞLU ZAKÇAYOL M(2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
https://doi.org/10.7240/jeps.1174193
Gao YFeng ZWang XSong MWang XWang XChen C(2023)Reinforcement learning based web crawler detection for diversity and dynamicsNeurocomputing10.1016/j.neucom.2022.11.059520(115-128)Online publication date: Feb-2023
https://doi.org/10.1016/j.neucom.2022.11.059
Barnova KMikolasova MKahankova RJaros RKawala-Sterniuk ASnasel VMirjalili SPelc MMartinek R(2023)Implementation of artificial intelligence and machine learning-based methods in brain–computer interactionComputers in Biology and Medicine10.1016/j.compbiomed.2023.107135163:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107135
B. BRani KNeog A(2022)Finding Multidimensional Constraint Reachable Paths for Attributed GraphsICST Transactions on Scalable Information Systems10.4108/eetsis.v9i4.2581(e2)Online publication date: 22-Aug-2022
https://doi.org/10.4108/eetsis.v9i4.2581
Qiang RYang J(2022)Influential Spreader Identification in Complex Networks Based on Network Connectivity and EfficiencyWireless Communications & Mobile Computing10.1155/2022/78963802022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7896380
Hörberg TLarsson MOlofsson J(2022)The Semantic Organization of the English Odor VocabularyCognitive Science10.1111/cogs.1320546:11Online publication date: 5-Nov-2022
https://doi.org/10.1111/cogs.13205
Assefi MBahrami MArora STaha TArabnia HRasheed KChen W(2022)An Intelligent Data-Centric Web Crawler Service for API Corpus Construction at Scale2022 IEEE International Conference on Web Services (ICWS)10.1109/ICWS55610.2022.00064(385-390)Online publication date: Jul-2022
https://doi.org/10.1109/ICWS55610.2022.00064
Makkar A(2022)SecureEngine: Spammer classification in cyber defence for leveraging green computing in Sustainable citySustainable Cities and Society10.1016/j.scs.2021.10365879(103658)Online publication date: Apr-2022
https://doi.org/10.1016/j.scs.2021.103658
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Content and link-structure perspective of ranking webpages: A review

A Googol of Information about Google

Querying the web graph

Reviews

Access critical reviews of Computing literature here