[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Searching the Web

Published: 01 August 2001 Publication History

Abstract

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

References

[1]
AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms. Addison-Wesley, Reading, MA.]]
[2]
ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).]]
[3]
AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.]]
[4]
ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 290-297.]]
[5]
BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Data Bases.]]
[6]
BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.]]
[7]
BHARAT,K.AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the Eighth International Conference on The World-Wide Web.]]
[8]
BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN Syst. 30, 1-7, 469-477.]]
[9]
BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of the Sixth International Conference on Extending Database Technology (Valencia, Spain, Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.]]
[10]
BRIN,S.AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107-117.]]
[11]
BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings of the Ninth International Conference on The World Wide Web.]]
[12]
CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.]]
[13]
CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27, 2, 307-318.]]
[14]
CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG,J. 1998c. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International Conference on The World Wide Web (WWW7, Brisbane, Australia, Apr. 14-18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci. Pub. B. V., Amsterdam, The Netherlands, 65-74.]]
[15]
CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96, Padua, Italy, June 24-26), G. E. Blelloch, Chair. ACM Press, New York, NY, 329-335.]]
[16]
CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the Eighth International Conference on The World-Wide Web.]]
[17]
CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.]]
[18]
CHO,J.AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases.]]
[19]
CHO,J.AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD '2000, Dallas, TX, May). ACM Press, New York, NY.]]
[20]
CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161-172.]]
[21]
COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.]]
[22]
DEAN,J.AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In Proceedings of the Eighth International Conference on The World-Wide Web.]]
[23]
DILIGENTI, M., COETZEE,F.M.,LAWRENCE, S., GILES,C.L.,AND GORI, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases.]]
[24]
DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.]]
[25]
DUMAIS,S.T.,FURNAS,G.W.,LANDAUER,T.K.,DEERWESTER, S., AND HARSHMAN, R. 1988. Using latent semantic analysis to improve access to textual information. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '88, Washington, DC, May 15-19), J. J. O'Hare, Ed. ACM Press, New York, NY, 281-285.]]
[26]
EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.]]
[27]
FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49-74.]]
[28]
FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.), 267-288.]]
[29]
GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.]]
[30]
GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems (HYPERTEXT '98, Pittsburgh, PA, June 20-24), R. Akscyn, Chair. ACM Press, New York, NY, 225-234.]]
[31]
GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]]
[32]
HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.]]
[33]
HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).]]
[34]
HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of web pages. In Proceedings of the Ninth International Conference on The World Wide Web. 277-293.]]
[35]
HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).]]
[36]
KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46,6 (Nov.).]]
[37]
KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).]]
[38]
KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the Eighth International Conference on The World-Wide Web.]]
[39]
LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.]]
[40]
LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.]]
[41]
MANBER,U.AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5 (Oct.), 935-948.]]
[42]
MACLEOD,I.A.,MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '86, Palazzo dei Congressi, Pisa, Italy, Sept. 8-10), F. Rabitti, Ed. ACM Press, New York, NY, 131-137.]]
[43]
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.]]
[44]
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed full-text index for the web. In Proceedings of the Tenth International Conference on The World-Wide Web.]]
[45]
MOFFAT,A.AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am. Soc. Inf. Sci. 46, 7 (Aug.), 537-550.]]
[46]
MOTWANI,R.AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.]]
[47]
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep. Computer Systems Laboratory, Stanford University, Stanford, CA.]]
[48]
PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.]]
[49]
PITKOW,J.AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97, Atlanta, GA, Mar. 22-27), S. Pemberton, Ed. ACM Press, New York, NY, 383-390.]]
[50]
RIBEIRO-NETO,B.A.AND BARBOSA, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the Third ACM Conference on Digital Libraries (DL '98, Pittsburgh, PA, June 23-26), I. Witten, R. Akscyn, and F. M. Shipman, Eds. ACM Press, New York, NY, 182-190.]]
[51]
ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.]]
[52]
SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.]]
[53]
TOMASIC,A.AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text document retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems (Dec.). 8-17.]]
[54]
VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.]]
[55]
WILLS,C.E.AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International Conference on The World-Wide Web.]]
[56]
WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.]]

Cited By

View all
  • (2024)How do I find reusable models?Software and Systems Modeling (SoSyM)10.1007/s10270-023-01103-723:1(85-102)Online publication date: 1-Feb-2024
  • (2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
  • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
  • Show More Cited By

Recommendations

Reviews

Jun Lin

Are you curious about the way Web search engines provide users with a list of URLs after just a few keywords are entered__?__ This article gives an overview on the core engine that makes this possible. The authors start by discussing the challenges of the Internet and the architecture of Web search engines. The major challenges—the size, rapid change, lack of coherence, and interlinked nature of the Internet—introduce the rest of the article. Chapter Two discusses the discovery of information from the Internet, with a detailed explanation of the challenges of crawler models in respect to page selection and page refresh. Chapter Three introduces the storage and distributed Web repository. In chapters Four and Five, the authors present the most popular indexing architectures used in Web search services and ranking and link analysis. The article covers in detail the fundamental components of Web search engines and the most common design and implementation. For each component discussed, a conclusion section is provided to summarize the concepts and give further alert on challenges to be addressed in the future. Theoretical analysis and arguments are supported by the authors’ own experiments in which data and statistics are provided. The article is most suitable for readers who are interested in the design of Web search engines. In addition, readers who want to submit their Web pages to search engines can also be benefit from reading this article to increase the matching-rate and ranking of their Web pages among the search engines. For other readers who search the Internet to find certain information, this article is also a good source to understand the technology involved, though it does not contain direct guidelines towards the most popular Web search engines on the market. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 1, Issue 1
Aug. 2001
140 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/383034
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2001
Published in TOIT Volume 1, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HITS
  2. PageRank
  3. authorities
  4. crawling
  5. indexing
  6. information retrieval
  7. link analysis
  8. search engine

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)4
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)How do I find reusable models?Software and Systems Modeling (SoSyM)10.1007/s10270-023-01103-723:1(85-102)Online publication date: 1-Feb-2024
  • (2024)Web Miner: Automated Web Crawling and Database System with Puppeteer and Node.jsSmart Systems: Innovations in Computing10.1007/978-981-97-3690-4_12(149-159)Online publication date: 30-Sep-2024
  • (2023)Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web CrawlerInternational Journal of Advances in Engineering and Pure Sciences10.7240/jeps.117419335:1(27-38)Online publication date: 30-Mar-2023
  • (2023)Reinforcement learning based web crawler detection for diversity and dynamicsNeurocomputing10.1016/j.neucom.2022.11.059520(115-128)Online publication date: Feb-2023
  • (2023)Implementation of artificial intelligence and machine learning-based methods in brain–computer interactionComputers in Biology and Medicine10.1016/j.compbiomed.2023.107135163:COnline publication date: 1-Sep-2023
  • (2022)Finding Multidimensional Constraint Reachable Paths for Attributed GraphsICST Transactions on Scalable Information Systems10.4108/eetsis.v9i4.2581(e2)Online publication date: 22-Aug-2022
  • (2022)Influential Spreader Identification in Complex Networks Based on Network Connectivity and EfficiencyWireless Communications & Mobile Computing10.1155/2022/78963802022Online publication date: 1-Jan-2022
  • (2022)The Semantic Organization of the English Odor VocabularyCognitive Science10.1111/cogs.1320546:11Online publication date: 5-Nov-2022
  • (2022)An Intelligent Data-Centric Web Crawler Service for API Corpus Construction at Scale2022 IEEE International Conference on Web Services (ICWS)10.1109/ICWS55610.2022.00064(385-390)Online publication date: Jul-2022
  • (2022)SecureEngine: Spammer classification in cyber defence for leveraging green computing in Sustainable citySustainable Cities and Society10.1016/j.scs.2021.10365879(103658)Online publication date: Apr-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media