More Web Proxy on the site http://driver.im/

article

Free access

B-tries for disk-based string management

Authors:

Nikolas Askitis,

Justin ZobelAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 18, Issue 1

Pages 157 - 179

https://doi.org/10.1007/s00778-008-0094-1

Published: 01 January 2009 Publication History

Abstract

A wide range of applications require that large quantities of data be maintained in sort order on disk. The B-tree, and its variants, are an efficient general-purpose disk-based data structure that is almost universally used for this task. The B-trie has the potential to be a competitive alternative for the storage of data where strings are used as keys, but has not previously been thoroughly described or tested. We propose new algorithms for the insertion, deletion, and equality search of variable-length strings in a disk-resident B-trie, as well as novel splitting strategies which are a critical element of a practical implementation. We experimentally compare the B-trie against variants of B-tree on several large sets of strings with a range of characteristics. Our results demonstrate that, although the B-trie uses more memory, it is faster, more scalable, and requires less disk space.

References

[1]

Aoe, J., Morimoto, K., Sato, T.: An efficient implementation of trie structures. Softw Practice Exp 22(9), 695-721 (1992).

Digital Library

[2]

Arge, L.: The buffer tree: a new technique for optimal I/O-algorithms. In: Proc. Int. Workshop on Algorithms and Data Structures, pp. 334-345. Kingston (1995).

Digital Library

[3]

Arge, L.: External memory data structures. In: Handbook of Massive Data Sets, pp. 313-357. Kluwer, Norwell (2002).

Digital Library

[4]

Arnow, D.M., Tenenbaum, A.M.: An empirical comparison of B-trees, compact B-trees and multiway trees. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 33-46. Boston (1984).

Digital Library

[5]

Arnow, D.M., Tenenbaum, A.M., Wu, C.: P-trees: Storage efficient multiway trees. In: Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pp. 111-121. Montreal (1985).

Digital Library

[6]

Askitis, N., Zobel, J.: Cache-conscious collision resolution in string hash tables. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 91-102. Buenos Aires (2005).

Digital Library

[7]

Baeza-Yates, R.A.: An adaptive overflow technique for B-trees. In: Proc. Int. Conf. on Extending Database Technology, pp. 16- 28, Venice (1990).

Digital Library

[8]

Baeza-Yates, R.A., Larson, P.A.: Performance of B⁺-trees with partial expansions. IEEE Trans Knowl Data Eng 1(2), 248- 257 (1989).

Digital Library

[9]

Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indices. Acta Inf 1(3), 173-189 (1972).

Digital Library

[10]

Bayer, R., Unterauer, K.: Prefix B-trees. ACM Trans Database Systems 2(1), 11-26 (1977).

Digital Library

[11]

Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression, 1st edn. Prentice-Hall, New Jersey (1990).

Digital Library

[12]

Bell, T.C., Moffat, A., Witten, I.H., Zobel, J.: The MG retrieval system: compressing for space and speed. Commun ACM 38(4), 41-42 (1995).

Digital Library

[13]

Ben-Asher, Y., Farchi, E., Newman, I.: Optimal search in trees. SIAM J. Comput. 28(6), 2090-2102 (1999).

Digital Library

[14]

Bender, M.A., Demaine, E.D., Farach-Colton, M.: Cache-oblivious B-trees. In: Proc. IEEE Foundations of Computer Science, pp. 399-409, Redondo Beach (2000).

Digital Library

[15]

Bender, M.A., Demaine, E.D., Farach-Colton, M.: Efficient tree layout in amultilevelmemory hierarchy. In: Proc. European Symp. on Algorithms, pp. 165-173, Rome (2002).

Digital Library

[16]

Bender, M.A., Duan, Z., Iacono, J., Wu, J.: A locality-preserving cache-oblivious dynamic dictionary. J. Algorithms 53(2), 115- 136 (2004).

Digital Library

[17]

Bender, M.A., Farach-Colton, M., Kuszmaul, B.C.: Cache-oblivious string B-trees. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 233-242. Chicago (2006).

Digital Library

[18]

Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 360-369. New Orleans (1997).

Digital Library

[19]

de la Briandais, R.: File searching using variable length keys. In: Proc. Western Joint Computer Conference, pp. 295-298, New York (1959).

Digital Library

[20]

Brodal, G., Fagerberg, R.: Cache-oblivious string dictionaries. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 581-590, Miami (2006).

Digital Library

[21]

Chang, Y., Lee, C., ChangLiaw, W.: Linear spiral hashing for expansible files. IEEE Trans. Knowl. Data Eng. 11(6), 969- 984 (1999).

Digital Library

[22]

Cheung, C., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Trans. Knowl. Data Eng. 17, 90-105 (2005).

Digital Library

[23]

Chong, E.I., Srinivasan, J., Das, S., Freiwald, C., Yalamanchi, A., Jagannath, M., Tran, A., Krishnan, R., Jiang, R.: A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B⁺trees. ACM SIGMOD Record 32(2), 78-88 (2003).

Digital Library

[24]

Chowdhury, N.M.M.K., Akbar, M.M., Kaykobad, M.: DiskTrie: An efficient data structure using flash memory for mobile devices. In: Workshop on Algorithms and Computation, pp. 76-87. Bangladesh Computer Council Bhaban, Agargaon (2007).

[25]

Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: Static optimality theorem for external memory string access. In: IEEE Symp. on the Foundations of Computer Science, pp. 219-227, Vancouver (2002).

Digital Library

[26]

Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: A data structure for a sequence of string accesses in external memory. ACM Trans. Algorithms 3(1), 6 (2007).

Digital Library

[27]

Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 383-391, Atlanta (1996).

Digital Library

[28]

Comer, D.: Heuristics for trie index minimization. ACM Trans. Database Systems 4(3), 383-395 (1979).

Digital Library

[29]

Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121- 137 (1979).

Digital Library

[30]

Crauser, A., Ferragina, P.: On constructing suffix arrays in external memory. In: Proc. of European Symp. on Algorithms, pp. 224-235, Prague (1999).

Digital Library

[31]

Culik, K., Ottmann, T., Wood, D.: Dense multiway trees. ACM Trans. Database Systems 6(3), 486-512 (1981).

Digital Library

[32]

Deschler, K.W., Rundensteiner, E.A.: B+Retake: Sustaining high volume inserts into large data pages. In: Proc. Int. Workshop on Data Warehousing and OLAP, pp. 56-63, Atlanta (2001).

Digital Library

[33]

Fan, X., Yang, Y., Zhang, L.: Implementation and evaluation of String B-tree. Tech. rep., University of Florida (2001).

[34]

Farach, M., Ferragina, P., Muthukrishnan, S.: Overcoming the memory bottleneck in suffix tree construction. In: IEEE Symp. on the Foundations of Computer Science, p. 174, Palo Alto (1998).

Digital Library

[35]

Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 373-382, Atlanta (1996).

Digital Library

[36]

Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236-280 (1999).

Digital Library

[37]

Ferragina, P., Luccio, F.: Dynamic dictionary matching in external memory. Inf. Comput. 146(2), 85-99 (1998).

Digital Library

[38]

Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552-581 (2005).

Digital Library

[39]

Flajolet, P., Puech, C.: Partial match retrieval of multimedia data. J. ACM 33(2), 371-407 (1986).

Digital Library

[40]

Foster, C.C.: Information retrieval: information storage and retrieval using AVL trees. In: Proc. National Conf., pp. 192-205, Cleveland (1965).

Digital Library

[41]

Fredkin, E.: Trie memory. Commun. ACM 3(9), 490-499 (1960).

Digital Library

[42]

Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cacheoblivious algorithms. In: IEEE Symp. on the Foundations of Computer Science, p. 285, New York City (1999).

Digital Library

[43]

Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: the Complete Book, 1st edn. Prentice-Hall, New Jersey (2001).

Digital Library

[44]

Gonnet, G.H., Larson, P.: External hashing with limited internal storage. J. ACM 35(1), 161-184 (1988).

Digital Library

[45]

Gray, J., Graefe, G.: The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4), 63- 68 (1997).

Digital Library

[46]

Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques, 1st edn. Morgan Kaufmann, San Francisco (1992).

Digital Library

[47]

Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proc. ACM Symp. on Theory of Computing, pp. 397- 406, Portland (2000).

Digital Library

[48]

Guibas, L.J., Sedgewick, R.: A dichromatic framework for balanced trees. In: IEEE Symp. on the Foundations of Computer Science, pp. 8-21, Ann Arbor (1978).

Digital Library

[49]

Hansen, W.J.: A cost model for the internal organization of B⁺- tree nodes. ACM Trans. Program. Languages Systems 3(4), 508- 532 (1981).

Digital Library

[50]

Harman, D.: Overview of the second text retrieval conf. (TREC- 2). Inf. Process. Manage. 31(3), 271-289 (1995).

Digital Library

[51]

Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Systems 20(2), 192- 223 (2002).

Digital Library

[52]

Hui, L.C.K., Martel, C.: On efficient unsuccessful search. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 217-227, Orlando (1992).

Digital Library

[53]

Jannink, J.: Implementing deletion in B⁺-trees. Proc. ACM SIGMOD Int. Conf. Manag. Data 24(1), 33-38 (1995).

Digital Library

[54]

Johnson, T., Shasha, D.: Utilization of B-trees with inserts, deletes and modifies. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 235-246, Philadelphia (1989).

Digital Library

[55]

Johnson, T., Shasha, D.: B-trees with inserts and deletes: why free-at-empty is better than merge-at-half. J. Comput. System Sci. 47(1), 45-76 (1993).

Digital Library

[56]

Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Algorithms for Memory Hierarchies, pp. 149-170. Dagstuhl Research Seminar, Schloss Dagstuhl (2002).

[57]

Kato, K.: Persistently cached B-trees. IEEE Trans. Knowl. Data Eng. 15(3), 706-720 (2003).

Digital Library

[58]

Kelley, K.L., Rusinkiewicz, M.: Multikey extensible hashing for relational databases. IEEE Softw. 05(4), 77-85 (1988).

Digital Library

[59]

Knessl, C., Szpankowski, W.: A note on the asymptotic behavior of the height in B-tries for B large. Electron. J. Combinat. 7(R39) (2000).

[60]

Knessl, C., Szpankowski, W.: Limit laws for the height in Patricia tries. J. Algorithms 44(1), 63-97 (2002).

Digital Library

[61]

Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3, 2nd edn. Addison-Wesley Longman, Redwood City (1998).

Digital Library

[62]

Ko, P., Aluru, S.: Obtaining provably good performance from suffix trees in secondary storage. In: Proc. Symp. on Combinatorial Pattern Matching, pp. 72-83, Barcelona (2006).

Digital Library

[63]

Ko, P., Aluru, S.: Optimal self-adjusting trees for dynamic string data in secondary storage. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 184-194, Santiago (2007).

Digital Library

[64]

Kumar, P.: Cache oblivious algorithms. In: Algorithms for Memory Hierarchies, pp. 193-212. Dagstuhl Research Seminar, Schloss Dagstuhl (2003).

[65]

Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Practice Exp. 29(13), 1149-1171 (1999).

Digital Library

[66]

Ladner, R.E., Fortna, R., Nguyen, B.: A comparison of cache aware and cache oblivious static search trees using program instrumentation. In: Experimental Algorithmics: from Algorithm Design to Robust and Efficient Software, pp. 78-92, New York City (2002).

Digital Library

[67]

Larson, P.: Linear hashing with separators--a dynamic hashing scheme achieving one-access. ACM Trans. Database Systems 13(3), 366-388 (1988).

Digital Library

[68]

Lomet, D.B.: Partial expansions for file organizations with an index. ACM Trans. Database Systems 12(1), 65-84 (1987).

Digital Library

[69]

Mahmoud, H.M.: Evolution of Random Search Trees, 1st edn. J Wiley, New York (1992).

[70]

Makawita, D., Tan, K., Liu, H.: Sampling from databases using B⁺-trees. In: Proc. CIKM Int. Conf. on Information and Knowledge Management, pp. 158-164, McLean (2000).

Digital Library

[71]

Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 319-327, San Francisco (1990).

Digital Library

[72]

Martel, C.: Self-adjusting multi-way search trees. Inf. Process. Lett. 38(3), 135-141 (1991).

Digital Library

[73]

McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262-271 (1976).

Digital Library

[74]

Na, J.C., Park, K.: Simple implementation of String B-trees. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 214-215, Padova (2004).

[75]

Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 1-61 (2007).

Digital Library

[76]

Ooi, B.C., Tan, K.: B-trees: Bearing fruits of all kinds. In: Proc. Australasian Database Conf., pp. 13-20, Melbourne (2002).

Digital Library

[77]

Oracle: Berkeley DB, Oracle Embedded Database (2007). http:// www.oracle.com/technology/software/products/berkeley-db/ index.html. Version 4.5.20.

[78]

Pagh, R.: Basic external memory data structures. In: Algorithms for Memory Hierarchies, pp. 14-35. Dagstuhl Research Seminar, Schloss Dagstuhl (2002).

[79]

Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668-676 (1990).

Digital Library

[80]

Rao, J., Ross, K.A.: Making B⁺-trees cache conscious in main memory. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 475-486, Dallas (2000).

Digital Library

[81]

Rose, K.R.: Asynchronous generic key/value database. Master's thesis, Massachusetts Institute of Technology (2000).

[82]

Rosenberg, A.L., Snyder, L.: Time and space optimality in B-trees. ACM Trans. Database Systems 6(1), 174-193 (1981).

Digital Library

[83]

Sedgewick, R.: Algorithms in C, Parts 1-4: Fundamentals, Data structures, Sorting, and Searching, 3rd edn. Addison-Wesley, Boston (1998).

Digital Library

[84]

Severance, D.G.: Identifier search mechanisms: a survey and generalized model. ACM Comput. Surv. 6(3), 175-194 (1974).

Digital Library

[85]

Sherk, M.: Self-adjusting k-ary search trees. In: Proc. of Workshop on Algorithms and Data Structures, pp. 381-392, Ottawa (1989).

Digital Library

[86]

Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 7th edn. Wiley, Boston (2004).

Digital Library

[87]

Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM 32(3), 652-686 (1985).

Digital Library

[88]

Software, T.M.: C++ string B-tree library (2007). http:// wikipedia-clustering.speedblue.org/strBTree.php

[89]

Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley, New York City (2001).

Digital Library

[90]

Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. Int. J. Very Large Databases 14(3), 281- 299 (2005).

Digital Library

[91]

Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209-271 (2001).

Digital Library

[92]

Williams, H.E., Zobel, J., Heinz, S.: Self-adjusting trees in practice for large text collections. Softw. Practice Exp. 31(10), 925- 939 (2001).

Digital Library

[93]

Witten, I.H., Bell, T.C., Moffat, A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 1st edn. Morgan Kaufmann, San Francisco (1999).

Digital Library

[94]

Yao, A.C.: On random 2-3 trees. Acta Inf. 9, 159-170 (1978).

Digital Library

[95]

Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38, 1-56 (2006).

Digital Library

[96]

Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted files versus signature files for text indexing. ACM Trans. Database Systems 23(4), 453-490 (1998).

Digital Library

Cited By

Li PLu HZhu RDing BYang LPan G(2023)DILI: A Distribution-Driven Learned IndexProceedings of the VLDB Endowment10.14778/3598581.359859316:9(2212-2224)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598593
Wellenzohn KBöhlen MHelmer S(2021)Dynamic interleaving of content and structure for robust indexing of semi-structured hierarchical dataProceedings of the VLDB Endowment10.14778/3401960.340196313:10(1641-1653)Online publication date: 10-Mar-2021
https://dl.acm.org/doi/10.14778/3401960.3401963
Wellenzohn KPopovic LBöhlen MHelmer S(2021)Inserting Keys into the Robust Content-and-Structure (RCAS) IndexAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_10(121-135)Online publication date: 24-Aug-2021
https://dl.acm.org/doi/10.1007/978-3-030-82472-3_10
Show More Cited By

Index Terms

B-tries for disk-based string management

Recommendations

The string B-tree: a new data structure for string search in external memory and its applications

We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for ...
LSM-Trees and B-Trees: The Best of Both Worlds
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

LSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for ...
Redundant disk arrays: reliable, parallel secondary storage

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 18, Issue 1

January 2009

373 pages

ISSN:1066-8888

Issue’s Table of Contents

Copyright © Copyright © 2009 Springer-Verlag.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
610
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li PLu HZhu RDing BYang LPan G(2023)DILI: A Distribution-Driven Learned IndexProceedings of the VLDB Endowment10.14778/3598581.359859316:9(2212-2224)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598593
Wellenzohn KBöhlen MHelmer S(2021)Dynamic interleaving of content and structure for robust indexing of semi-structured hierarchical dataProceedings of the VLDB Endowment10.14778/3401960.340196313:10(1641-1653)Online publication date: 10-Mar-2021
https://dl.acm.org/doi/10.14778/3401960.3401963
Wellenzohn KPopovic LBöhlen MHelmer S(2021)Inserting Keys into the Robust Content-and-Structure (RCAS) IndexAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_10(121-135)Online publication date: 24-Aug-2021
https://dl.acm.org/doi/10.1007/978-3-030-82472-3_10
Erkan ABarr JClear TIzu Cdel Alamo CMohammed HNadimpalli MSheard JKorhonen A(2018)Developing a Holistic Understanding of Systems and Algorithms through Research PapersProceedings of the 2017 ITiCSE Conference on Working Group Reports10.1145/3174781.3174786(86-104)Online publication date: 30-Jan-2018
https://dl.acm.org/doi/10.1145/3174781.3174786
(2017)An analysis on the performance of hash table-based dictionary implementations with different data usage modelsInternational Journal of High Performance Computing and Networking10.5555/3070823.307083110:1-2(78-90)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3070823.3070831
Tran QMorfonios KPolyzotis NSellis TDavidson SIves Z(2015)Oracle Workload IntelligenceProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2742791(1669-1681)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2742791
Askitis NZobel J(2011)Redesigning the string hash table, burst trie, and BST to exploit cacheACM Journal of Experimental Algorithmics10.1145/1671970.192170415(1.1-1.61)Online publication date: 7-Feb-2011
https://dl.acm.org/doi/10.1145/1671970.1921704
Askitis N(2009)Fast and compact hash tables for integer keysProceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 9110.5555/1862659.1862675(113-122)Online publication date: 1-Jan-2009
https://dl.acm.org/doi/10.5555/1862659.1862675

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents