article

Building a distributed full-text index for the web

Authors:

Sergey Melink,

Sriram Raghavan,

Beverly Yang,

Hector Garcia-MolinaAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 19, Issue 3

Pages 217 - 241

https://doi.org/10.1145/502115.502116

Published: 01 July 2001 Publication History

Get Access

Abstract

We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.

References

[1]

ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval (August), 290-297.

Crossref

Google Scholar

[2]

BLAIR, D. C. 1988. An extended relational document retrieval model. Information Processing and Management 24, 3, 349-371.

Crossref

Google Scholar

[3]

BROWN, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR), ACM Press, New York, NY, 30-38.

Crossref

Google Scholar

[4]

BROWN,E.W.,CALLAN,J.P.,AND CROFT, W. B. 1994. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20st International Conference on Very Large Databases (September), 192-202.

Crossref

Google Scholar

[5]

BROWN,E.W.,CALLAN,J.P.,CROFT,W.B.,AND MOSS, J. E. B. 1994. Supporting full-text information retrieval with a persistent object store. In 4th International Conference on Extending Database Technology (March), 365-378.

Crossref

Google Scholar

[6]

CCITT. 1988. Recommendation X.209, Specification of Basic Encoding Rules for Abstract Syntax Notation one (ASN. 1).

Google Scholar

[7]

CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In 8th ACM Symposium on Parallel Algorithms and Architectures (June), ACM Press, New York, NY, 329-335.

Crossref

Google Scholar

[8]

CHO,J.AND GARCIA-MOLINA, H. 2000. The evolution of the web and implications for an incremental crawler. To appear in the 26th International Conference on Very Large Databases.

Crossref

Google Scholar

[9]

CRASWELL, N., HAWKING,D.,AND THISTLEWALTE, P. 1999. Merging results from isolated search engines. In Proceedings of the 10th Australasian Database Conference (January).

Google Scholar

[10]

DE KRETSER, O., MOFFAT, A., SHIMMMIN,T.,AND ZOBEL, J. 1998. Methodologies for distributed information retrieval. In Proceedings of the 18th International Conference on Distributed Computing Systems.

Crossref

Google Scholar

[11]

FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACMTransactions on Office Information Systems 2,4 (October), ACM Press, New York, NY, 267-288.

Crossref

Google Scholar

[12]

GARCIA-MOLINA, H., ULLMAN,J.,AND WIDOM, J. 2000. Database System Implementation. Prentice-Hall, Eaglewood Cliffs, NJ.

Crossref

Google Scholar

[13]

GORSSMAN,D.A.AND DRISCOLL, J. R. 1992. Structuring text within a relation system. In Proceedigns of the 3rd International Conference on Database and Expert System Applications (September), 72-77.

Google Scholar

[14]

GRAVANO, L., CHANG, K., GARCIA-MOLINA, H., LAGOZE,C.,AND PAEPCKE, A. 1997. STARTS-stanford protocol for internet retrieval and search. http://www-db.stanford.edu/ gravano/starts.html.

Google Scholar

[15]

HAWKING,D.AND CRASWELL, N. 1998. Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text Retrieval Conference (November), 91-104.

Google Scholar

[16]

HIRAI, J., GARCIA-MOLINA, H., AND PAEPCKE, A., RAGHAVAN, S. 2000. WebBase: A repository of web pages. In Proceedings of the 9th International World Wide Web Conference (May), 277-293.

Crossref

Google Scholar

[17]

INKTOMI. 2000. Inktomi WebMap. http://www.inktomi.com/webmap/.

Google Scholar

[18]

JEONG, B.-S. AND OMIECINSKI, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6, 2 (February), IEE Computer Society Press, Los Alamitos, CA, 142-153.

Crossref

Google Scholar

[19]

LAWRENCE,S.AND GILES, C. L. 1998. Inquirus, the NECI meta search engine. In Proceedings of the 7th International World Wide Web Conference.

Crossref

Google Scholar

[20]

LAWRENCE,S.AND GILES, C. L. 1999. Accessibility of information on the web. Nature 400, 107-109.

Google Scholar

[21]

MANBER,U.AND MYERS, G. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st ACM-SIAM Symposium on Discrete Algorithms, ACM Press, New York, NY, 319-327.

Crossref

Google Scholar

[22]

MARTIN, P., MACLEOD, I. A., AND NORDIN, B. 1986. Adesign of a distributed full text retrieval system. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (September), ACM Press, New York, NY, 131-137.

Crossref

Google Scholar

[23]

MELNIK, S., GARCIA-MOLINA, H., YANG,B.,AND RAGHAVAN, S. 2000. Building a distributed full-text index for the web. Technical Report SIDL-WP-2000-0140 (July), Stanford Digital Library Project, Computer Science Dept., Stanford University. Available at www-diglib.stanford.edu/cgibin/get/SIDL-WP-2000-0140.

Google Scholar

[24]

MOFFAT,A.AND BELL, T. 1995. In situ generation of compressed inverted files. Journal of the American Society for Information Science 46, 7, 537-550.

Crossref

Google Scholar

[25]

MOFFAT,A.AND ZOBEL, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14, 4 (October), ACM Press, New York, NY, 349-379.

Crossref

Google Scholar

[26]

OLSON, M., BOSTIC, K., AND SELTZER, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference (June).

Crossref

Google Scholar

[27]

RIBEIRO-NETO,B.AND BARBOSA, R. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM Conference on Digital Libraries (June), ACM Press, New York, NY, 182-190.

Crossref

Google Scholar

[28]

RIBEIRO-NETO, B., MOURA,E.S.,NEUBERT,M.S.,AND ZIVIANI, N. 1999. Efficient distributed algorithms to build inverted files. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (August), ACM Press, New York, NY, 105-112.

Crossref

Google Scholar

[29]

SALTON, G. 1989. Information Retrieval: Data Structures and Algorithms. Addison-Wesley, Reading, Massachussetts.

Google Scholar

[30]

TOMASIC,A.AND GARCIA-MOLINA, H. 1993a. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (January), 8-17.

Crossref

Google Scholar

[31]

TOMASIC,A.AND GARCIA-MOLINA, H. 1993b. Query processing and inverted indices in shared-nothing document information retrieval systems. VLDB Journal 2, 3, 243-275.

Crossref

Google Scholar

[32]

TOMASIC, A., GARCIA-MOLINA, H., AND SHOENS, K. 1994. Incremental update of inverted list for text document retrieval. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (May), ACM Press, New York, NY, 289-300.

Crossref

Google Scholar

[33]

VILES, C. L. 1994. Maintaining state in a distributed information retrieval system. In 32nd Southeast Conference of the ACM, ACM Press, New York, NY, 157-161.

Google Scholar

[34]

VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedigns of the 18th International ACM Conference on Research and Development in Information Retrieval (July), ACM Press, New York, NY, 12-20.

Crossref

Google Scholar

[35]

WITTEN, I. H., MOFFAT, A., AND BELL, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kauffman Publishing, San Francisco.

Crossref

Google Scholar

[36]

ZOBEL, J., MOFFAT, A., AND SACKS-DAVIS, R. 1992. An efficient indexing technique for full-text database systems. In 18th International Conference on Very Large Databases (August 1992), pp. 352-362.

Crossref

Google Scholar

Cited By

View all

Öztürk EMesut A(2024)CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritmasıGazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi10.17341/gazimmfd.119981139:3(1933-1944)Online publication date: 20-May-2024
https://doi.org/10.17341/gazimmfd.1199811
Aggarwal CAggarwal C(2022)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-030-96623-2_9(257-302)Online publication date: 10-Feb-2022
https://doi.org/10.1007/978-3-030-96623-2_9
Akram SWang ZWrigstad T(2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3459898.3463906
Show More Cited By

Index Terms

Building a distributed full-text index for the web
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

Building a distributed full-text index for the Web
WWW '01: Proceedings of the 10th international conference on World Wide Web
Dynamic maintenance of web indexes using landmarks
WWW '03: Proceedings of the 12th international conference on World Wide Web

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Inverted files versus signature files for text indexing

Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...

Reviews

Reviewer: Dimitrios Katsaros

The building of a distributed, full-text (inverted) index for very large collections of documents, such as those encountered in search engines for the Web, can create architectural challenges. This paper explains how a three-tier architecture can be employed in this scenario, and also presents some issues related to the use of an embedded database system as the repository for the inverted index, and the procedure of gathering statistics for the indexed terms. The three-tier architecture separates the tasks involved in the construction of a distributed inverted index. The first tier of this architecture stores the collection of documents to be indexed, the second performs the indexing, and the third accommodates the inverted index. This layered approach allows for pipelining of the different tasks, and thus results in improved performance and scalability. The rest of the issues discussed in this paper have been studied extensively in the literature; no significant additional contributions are made. In general, the paper is well written, comprehensive, and of appropriate length, with references to related work. It is intended for those with a strong background in information retrieval systems. Overall, the work introduces some important architectural design alternatives, which should be considered by anyone attempting to build a high-performance distributed full-text index for the Web. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 19, Issue 3

July 2001

119 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/502115

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2001

Published in TOIS Volume 19, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
2,148
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Öztürk EMesut A(2024)CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritmasıGazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi10.17341/gazimmfd.119981139:3(1933-1944)Online publication date: 20-May-2024
https://doi.org/10.17341/gazimmfd.1199811
Aggarwal CAggarwal C(2022)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-030-96623-2_9(257-302)Online publication date: 10-Feb-2022
https://doi.org/10.1007/978-3-030-96623-2_9
Akram SWang ZWrigstad T(2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3459898.3463906
Sharapova E(2020)Increase the Speed of Search Index in the Duplicate Text Detection Systems2020 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO)10.1109/SYNCHROINFO49631.2020.9166107(1-5)Online publication date: Jul-2020
https://doi.org/10.1109/SYNCHROINFO49631.2020.9166107
Raj P. M. KMohan ASrinivasa KRaj P.M. KMohan ASrinivasa K(2018)Link AnalysisPractical Social Network Analysis with Python10.1007/978-3-319-96746-2_13(245-278)Online publication date: 26-Aug-2018
https://doi.org/10.1007/978-3-319-96746-2_13
Aggarwal CAggarwal C(2018)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-319-73531-3_9(259-304)Online publication date: 20-Mar-2018
https://doi.org/10.1007/978-3-319-73531-3_9
Cieslewski TScaramuzza D(2017)Efficient Decentralized Visual Place Recognition Using a Distributed Inverted IndexIEEE Robotics and Automation Letters10.1109/LRA.2017.26501532:2(640-647)Online publication date: Apr-2017
https://doi.org/10.1109/LRA.2017.2650153
Hassen FAmel G(2017)An efficient synchronous indexing technique for full-text retrieval in distributed databasesProcedia Computer Science10.1016/j.procs.2017.08.071112:C(811-821)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.procs.2017.08.071
Tolosa GFeuerstein EBecchetti LMarchetti-Spaccamela A(2017)Performance improvements for search systems using an integrated cache of lists + intersectionsInformation Retrieval10.1007/s10791-017-9299-520:3(172-198)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10791-017-9299-5
Nagoudi EKhorsi ACherroun H(2016)Efficient inverted index with n-gram sampling for string matching in Arabic documents2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)10.1109/AICCSA.2016.7945743(1-7)Online publication date: Nov-2016
https://doi.org/10.1109/AICCSA.2016.7945743
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Building a distributed full-text index for the Web

Dynamic maintenance of web indexes using landmarks

Inverted files versus signature files for text indexing

Reviews

Access critical reviews of Computing literature here