[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Building a distributed full-text index for the web

Published: 01 July 2001 Publication History

Abstract

We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.

References

[1]
ANH,V.N.AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval (August), 290-297.
[2]
BLAIR, D. C. 1988. An extended relational document retrieval model. Information Processing and Management 24, 3, 349-371.
[3]
BROWN, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR), ACM Press, New York, NY, 30-38.
[4]
BROWN,E.W.,CALLAN,J.P.,AND CROFT, W. B. 1994. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20st International Conference on Very Large Databases (September), 192-202.
[5]
BROWN,E.W.,CALLAN,J.P.,CROFT,W.B.,AND MOSS, J. E. B. 1994. Supporting full-text information retrieval with a persistent object store. In 4th International Conference on Extending Database Technology (March), 365-378.
[6]
CCITT. 1988. Recommendation X.209, Specification of Basic Encoding Rules for Abstract Syntax Notation one (ASN. 1).
[7]
CHAKRABARTI,S.AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and scientific applications. In 8th ACM Symposium on Parallel Algorithms and Architectures (June), ACM Press, New York, NY, 329-335.
[8]
CHO,J.AND GARCIA-MOLINA, H. 2000. The evolution of the web and implications for an incremental crawler. To appear in the 26th International Conference on Very Large Databases.
[9]
CRASWELL, N., HAWKING,D.,AND THISTLEWALTE, P. 1999. Merging results from isolated search engines. In Proceedings of the 10th Australasian Database Conference (January).
[10]
DE KRETSER, O., MOFFAT, A., SHIMMMIN,T.,AND ZOBEL, J. 1998. Methodologies for distributed information retrieval. In Proceedings of the 18th International Conference on Distributed Computing Systems.
[11]
FALOUTSOS,C.AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. ACMTransactions on Office Information Systems 2,4 (October), ACM Press, New York, NY, 267-288.
[12]
GARCIA-MOLINA, H., ULLMAN,J.,AND WIDOM, J. 2000. Database System Implementation. Prentice-Hall, Eaglewood Cliffs, NJ.
[13]
GORSSMAN,D.A.AND DRISCOLL, J. R. 1992. Structuring text within a relation system. In Proceedigns of the 3rd International Conference on Database and Expert System Applications (September), 72-77.
[14]
GRAVANO, L., CHANG, K., GARCIA-MOLINA, H., LAGOZE,C.,AND PAEPCKE, A. 1997. STARTS-stanford protocol for internet retrieval and search. http://www-db.stanford.edu/ gravano/starts.html.
[15]
HAWKING,D.AND CRASWELL, N. 1998. Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text Retrieval Conference (November), 91-104.
[16]
HIRAI, J., GARCIA-MOLINA, H., AND PAEPCKE, A., RAGHAVAN, S. 2000. WebBase: A repository of web pages. In Proceedings of the 9th International World Wide Web Conference (May), 277-293.
[17]
INKTOMI. 2000. Inktomi WebMap. http://www.inktomi.com/webmap/.
[18]
JEONG, B.-S. AND OMIECINSKI, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6, 2 (February), IEE Computer Society Press, Los Alamitos, CA, 142-153.
[19]
LAWRENCE,S.AND GILES, C. L. 1998. Inquirus, the NECI meta search engine. In Proceedings of the 7th International World Wide Web Conference.
[20]
LAWRENCE,S.AND GILES, C. L. 1999. Accessibility of information on the web. Nature 400, 107-109.
[21]
MANBER,U.AND MYERS, G. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st ACM-SIAM Symposium on Discrete Algorithms, ACM Press, New York, NY, 319-327.
[22]
MARTIN, P., MACLEOD, I. A., AND NORDIN, B. 1986. Adesign of a distributed full text retrieval system. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (September), ACM Press, New York, NY, 131-137.
[23]
MELNIK, S., GARCIA-MOLINA, H., YANG,B.,AND RAGHAVAN, S. 2000. Building a distributed full-text index for the web. Technical Report SIDL-WP-2000-0140 (July), Stanford Digital Library Project, Computer Science Dept., Stanford University. Available at www-diglib.stanford.edu/cgibin/get/SIDL-WP-2000-0140.
[24]
MOFFAT,A.AND BELL, T. 1995. In situ generation of compressed inverted files. Journal of the American Society for Information Science 46, 7, 537-550.
[25]
MOFFAT,A.AND ZOBEL, J. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14, 4 (October), ACM Press, New York, NY, 349-379.
[26]
OLSON, M., BOSTIC, K., AND SELTZER, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference (June).
[27]
RIBEIRO-NETO,B.AND BARBOSA, R. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM Conference on Digital Libraries (June), ACM Press, New York, NY, 182-190.
[28]
RIBEIRO-NETO, B., MOURA,E.S.,NEUBERT,M.S.,AND ZIVIANI, N. 1999. Efficient distributed algorithms to build inverted files. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (August), ACM Press, New York, NY, 105-112.
[29]
SALTON, G. 1989. Information Retrieval: Data Structures and Algorithms. Addison-Wesley, Reading, Massachussetts.
[30]
TOMASIC,A.AND GARCIA-MOLINA, H. 1993a. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (January), 8-17.
[31]
TOMASIC,A.AND GARCIA-MOLINA, H. 1993b. Query processing and inverted indices in shared-nothing document information retrieval systems. VLDB Journal 2, 3, 243-275.
[32]
TOMASIC, A., GARCIA-MOLINA, H., AND SHOENS, K. 1994. Incremental update of inverted list for text document retrieval. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (May), ACM Press, New York, NY, 289-300.
[33]
VILES, C. L. 1994. Maintaining state in a distributed information retrieval system. In 32nd Southeast Conference of the ACM, ACM Press, New York, NY, 157-161.
[34]
VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedigns of the 18th International ACM Conference on Research and Development in Information Retrieval (July), ACM Press, New York, NY, 12-20.
[35]
WITTEN, I. H., MOFFAT, A., AND BELL, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kauffman Publishing, San Francisco.
[36]
ZOBEL, J., MOFFAT, A., AND SACKS-DAVIS, R. 1992. An efficient indexing technique for full-text database systems. In 18th International Conference on Very Large Databases (August 1992), pp. 352-362.

Cited By

View all
  • (2024)CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritmasıGazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi10.17341/gazimmfd.119981139:3(1933-1944)Online publication date: 20-May-2024
  • (2022)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-030-96623-2_9(257-302)Online publication date: 10-Feb-2022
  • (2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
  • Show More Cited By

Recommendations

Reviews

Dimitrios Katsaros

The building of a distributed, full-text (inverted) index for very large collections of documents, such as those encountered in search engines for the Web, can create architectural challenges. This paper explains how a three-tier architecture can be employed in this scenario, and also presents some issues related to the use of an embedded database system as the repository for the inverted index, and the procedure of gathering statistics for the indexed terms. The three-tier architecture separates the tasks involved in the construction of a distributed inverted index. The first tier of this architecture stores the collection of documents to be indexed, the second performs the indexing, and the third accommodates the inverted index. This layered approach allows for pipelining of the different tasks, and thus results in improved performance and scalability. The rest of the issues discussed in this paper have been studied extensively in the literature; no significant additional contributions are made. In general, the paper is well written, comprehensive, and of appropriate length, with references to related work. It is intended for those with a strong background in information retrieval systems. Overall, the work introduces some important architectural design alternatives, which should be considered by anyone attempting to build a high-performance distributed full-text index for the Web. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 19, Issue 3
July 2001
119 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/502115
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2001
Published in TOIS Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed indexing
  2. Embedded databases
  3. Inverted files
  4. Pipelining
  5. Text retrieval

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritmasıGazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi10.17341/gazimmfd.119981139:3(1933-1944)Online publication date: 20-May-2024
  • (2022)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-030-96623-2_9(257-302)Online publication date: 10-Feb-2022
  • (2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
  • (2020)Increase the Speed of Search Index in the Duplicate Text Detection Systems2020 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO)10.1109/SYNCHROINFO49631.2020.9166107(1-5)Online publication date: Jul-2020
  • (2018)Link AnalysisPractical Social Network Analysis with Python10.1007/978-3-319-96746-2_13(245-278)Online publication date: 26-Aug-2018
  • (2018)Information Retrieval and Search EnginesMachine Learning for Text10.1007/978-3-319-73531-3_9(259-304)Online publication date: 20-Mar-2018
  • (2017)Efficient Decentralized Visual Place Recognition Using a Distributed Inverted IndexIEEE Robotics and Automation Letters10.1109/LRA.2017.26501532:2(640-647)Online publication date: Apr-2017
  • (2017)An efficient synchronous indexing technique for full-text retrieval in distributed databasesProcedia Computer Science10.1016/j.procs.2017.08.071112:C(811-821)Online publication date: 1-Sep-2017
  • (2017)Performance improvements for search systems using an integrated cache of lists + intersectionsInformation Retrieval10.1007/s10791-017-9299-520:3(172-198)Online publication date: 1-Jun-2017
  • (2016)Efficient inverted index with n-gram sampling for string matching in Arabic documents2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)10.1109/AICCSA.2016.7945743(1-7)Online publication date: Nov-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media