[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

B-tries for disk-based string management

Published: 01 January 2009 Publication History

Abstract

A wide range of applications require that large quantities of data be maintained in sort order on disk. The B-tree, and its variants, are an efficient general-purpose disk-based data structure that is almost universally used for this task. The B-trie has the potential to be a competitive alternative for the storage of data where strings are used as keys, but has not previously been thoroughly described or tested. We propose new algorithms for the insertion, deletion, and equality search of variable-length strings in a disk-resident B-trie, as well as novel splitting strategies which are a critical element of a practical implementation. We experimentally compare the B-trie against variants of B-tree on several large sets of strings with a range of characteristics. Our results demonstrate that, although the B-trie uses more memory, it is faster, more scalable, and requires less disk space.

References

[1]
Aoe, J., Morimoto, K., Sato, T.: An efficient implementation of trie structures. Softw Practice Exp 22(9), 695-721 (1992).
[2]
Arge, L.: The buffer tree: a new technique for optimal I/O-algorithms. In: Proc. Int. Workshop on Algorithms and Data Structures, pp. 334-345. Kingston (1995).
[3]
Arge, L.: External memory data structures. In: Handbook of Massive Data Sets, pp. 313-357. Kluwer, Norwell (2002).
[4]
Arnow, D.M., Tenenbaum, A.M.: An empirical comparison of B-trees, compact B-trees and multiway trees. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 33-46. Boston (1984).
[5]
Arnow, D.M., Tenenbaum, A.M., Wu, C.: P-trees: Storage efficient multiway trees. In: Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pp. 111-121. Montreal (1985).
[6]
Askitis, N., Zobel, J.: Cache-conscious collision resolution in string hash tables. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 91-102. Buenos Aires (2005).
[7]
Baeza-Yates, R.A.: An adaptive overflow technique for B-trees. In: Proc. Int. Conf. on Extending Database Technology, pp. 16- 28, Venice (1990).
[8]
Baeza-Yates, R.A., Larson, P.A.: Performance of B+-trees with partial expansions. IEEE Trans Knowl Data Eng 1(2), 248- 257 (1989).
[9]
Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indices. Acta Inf 1(3), 173-189 (1972).
[10]
Bayer, R., Unterauer, K.: Prefix B-trees. ACM Trans Database Systems 2(1), 11-26 (1977).
[11]
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression, 1st edn. Prentice-Hall, New Jersey (1990).
[12]
Bell, T.C., Moffat, A., Witten, I.H., Zobel, J.: The MG retrieval system: compressing for space and speed. Commun ACM 38(4), 41-42 (1995).
[13]
Ben-Asher, Y., Farchi, E., Newman, I.: Optimal search in trees. SIAM J. Comput. 28(6), 2090-2102 (1999).
[14]
Bender, M.A., Demaine, E.D., Farach-Colton, M.: Cache-oblivious B-trees. In: Proc. IEEE Foundations of Computer Science, pp. 399-409, Redondo Beach (2000).
[15]
Bender, M.A., Demaine, E.D., Farach-Colton, M.: Efficient tree layout in amultilevelmemory hierarchy. In: Proc. European Symp. on Algorithms, pp. 165-173, Rome (2002).
[16]
Bender, M.A., Duan, Z., Iacono, J., Wu, J.: A locality-preserving cache-oblivious dynamic dictionary. J. Algorithms 53(2), 115- 136 (2004).
[17]
Bender, M.A., Farach-Colton, M., Kuszmaul, B.C.: Cache-oblivious string B-trees. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 233-242. Chicago (2006).
[18]
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 360-369. New Orleans (1997).
[19]
de la Briandais, R.: File searching using variable length keys. In: Proc. Western Joint Computer Conference, pp. 295-298, New York (1959).
[20]
Brodal, G., Fagerberg, R.: Cache-oblivious string dictionaries. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 581-590, Miami (2006).
[21]
Chang, Y., Lee, C., ChangLiaw, W.: Linear spiral hashing for expansible files. IEEE Trans. Knowl. Data Eng. 11(6), 969- 984 (1999).
[22]
Cheung, C., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Trans. Knowl. Data Eng. 17, 90-105 (2005).
[23]
Chong, E.I., Srinivasan, J., Das, S., Freiwald, C., Yalamanchi, A., Jagannath, M., Tran, A., Krishnan, R., Jiang, R.: A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B+trees. ACM SIGMOD Record 32(2), 78-88 (2003).
[24]
Chowdhury, N.M.M.K., Akbar, M.M., Kaykobad, M.: DiskTrie: An efficient data structure using flash memory for mobile devices. In: Workshop on Algorithms and Computation, pp. 76-87. Bangladesh Computer Council Bhaban, Agargaon (2007).
[25]
Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: Static optimality theorem for external memory string access. In: IEEE Symp. on the Foundations of Computer Science, pp. 219-227, Vancouver (2002).
[26]
Ciriani, V., Ferragina, P., Luccio, F., Muthukrishnan, S.: A data structure for a sequence of string accesses in external memory. ACM Trans. Algorithms 3(1), 6 (2007).
[27]
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 383-391, Atlanta (1996).
[28]
Comer, D.: Heuristics for trie index minimization. ACM Trans. Database Systems 4(3), 383-395 (1979).
[29]
Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121- 137 (1979).
[30]
Crauser, A., Ferragina, P.: On constructing suffix arrays in external memory. In: Proc. of European Symp. on Algorithms, pp. 224-235, Prague (1999).
[31]
Culik, K., Ottmann, T., Wood, D.: Dense multiway trees. ACM Trans. Database Systems 6(3), 486-512 (1981).
[32]
Deschler, K.W., Rundensteiner, E.A.: B+Retake: Sustaining high volume inserts into large data pages. In: Proc. Int. Workshop on Data Warehousing and OLAP, pp. 56-63, Atlanta (2001).
[33]
Fan, X., Yang, Y., Zhang, L.: Implementation and evaluation of String B-tree. Tech. rep., University of Florida (2001).
[34]
Farach, M., Ferragina, P., Muthukrishnan, S.: Overcoming the memory bottleneck in suffix tree construction. In: IEEE Symp. on the Foundations of Computer Science, p. 174, Palo Alto (1998).
[35]
Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 373-382, Atlanta (1996).
[36]
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236-280 (1999).
[37]
Ferragina, P., Luccio, F.: Dynamic dictionary matching in external memory. Inf. Comput. 146(2), 85-99 (1998).
[38]
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552-581 (2005).
[39]
Flajolet, P., Puech, C.: Partial match retrieval of multimedia data. J. ACM 33(2), 371-407 (1986).
[40]
Foster, C.C.: Information retrieval: information storage and retrieval using AVL trees. In: Proc. National Conf., pp. 192-205, Cleveland (1965).
[41]
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490-499 (1960).
[42]
Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cacheoblivious algorithms. In: IEEE Symp. on the Foundations of Computer Science, p. 285, New York City (1999).
[43]
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: the Complete Book, 1st edn. Prentice-Hall, New Jersey (2001).
[44]
Gonnet, G.H., Larson, P.: External hashing with limited internal storage. J. ACM 35(1), 161-184 (1988).
[45]
Gray, J., Graefe, G.: The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4), 63- 68 (1997).
[46]
Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques, 1st edn. Morgan Kaufmann, San Francisco (1992).
[47]
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proc. ACM Symp. on Theory of Computing, pp. 397- 406, Portland (2000).
[48]
Guibas, L.J., Sedgewick, R.: A dichromatic framework for balanced trees. In: IEEE Symp. on the Foundations of Computer Science, pp. 8-21, Ann Arbor (1978).
[49]
Hansen, W.J.: A cost model for the internal organization of B+- tree nodes. ACM Trans. Program. Languages Systems 3(4), 508- 532 (1981).
[50]
Harman, D.: Overview of the second text retrieval conf. (TREC- 2). Inf. Process. Manage. 31(3), 271-289 (1995).
[51]
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Systems 20(2), 192- 223 (2002).
[52]
Hui, L.C.K., Martel, C.: On efficient unsuccessful search. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 217-227, Orlando (1992).
[53]
Jannink, J.: Implementing deletion in B+-trees. Proc. ACM SIGMOD Int. Conf. Manag. Data 24(1), 33-38 (1995).
[54]
Johnson, T., Shasha, D.: Utilization of B-trees with inserts, deletes and modifies. In: Proc. of ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 235-246, Philadelphia (1989).
[55]
Johnson, T., Shasha, D.: B-trees with inserts and deletes: why free-at-empty is better than merge-at-half. J. Comput. System Sci. 47(1), 45-76 (1993).
[56]
Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Algorithms for Memory Hierarchies, pp. 149-170. Dagstuhl Research Seminar, Schloss Dagstuhl (2002).
[57]
Kato, K.: Persistently cached B-trees. IEEE Trans. Knowl. Data Eng. 15(3), 706-720 (2003).
[58]
Kelley, K.L., Rusinkiewicz, M.: Multikey extensible hashing for relational databases. IEEE Softw. 05(4), 77-85 (1988).
[59]
Knessl, C., Szpankowski, W.: A note on the asymptotic behavior of the height in B-tries for B large. Electron. J. Combinat. 7(R39) (2000).
[60]
Knessl, C., Szpankowski, W.: Limit laws for the height in Patricia tries. J. Algorithms 44(1), 63-97 (2002).
[61]
Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3, 2nd edn. Addison-Wesley Longman, Redwood City (1998).
[62]
Ko, P., Aluru, S.: Obtaining provably good performance from suffix trees in secondary storage. In: Proc. Symp. on Combinatorial Pattern Matching, pp. 72-83, Barcelona (2006).
[63]
Ko, P., Aluru, S.: Optimal self-adjusting trees for dynamic string data in secondary storage. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 184-194, Santiago (2007).
[64]
Kumar, P.: Cache oblivious algorithms. In: Algorithms for Memory Hierarchies, pp. 193-212. Dagstuhl Research Seminar, Schloss Dagstuhl (2003).
[65]
Kurtz, S.: Reducing the space requirement of suffix trees. Softw. Practice Exp. 29(13), 1149-1171 (1999).
[66]
Ladner, R.E., Fortna, R., Nguyen, B.: A comparison of cache aware and cache oblivious static search trees using program instrumentation. In: Experimental Algorithmics: from Algorithm Design to Robust and Efficient Software, pp. 78-92, New York City (2002).
[67]
Larson, P.: Linear hashing with separators--a dynamic hashing scheme achieving one-access. ACM Trans. Database Systems 13(3), 366-388 (1988).
[68]
Lomet, D.B.: Partial expansions for file organizations with an index. ACM Trans. Database Systems 12(1), 65-84 (1987).
[69]
Mahmoud, H.M.: Evolution of Random Search Trees, 1st edn. J Wiley, New York (1992).
[70]
Makawita, D., Tan, K., Liu, H.: Sampling from databases using B+-trees. In: Proc. CIKM Int. Conf. on Information and Knowledge Management, pp. 158-164, McLean (2000).
[71]
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proc. ACM SIAM Symp. on Discrete Algorithms, pp. 319-327, San Francisco (1990).
[72]
Martel, C.: Self-adjusting multi-way search trees. Inf. Process. Lett. 38(3), 135-141 (1991).
[73]
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262-271 (1976).
[74]
Na, J.C., Park, K.: Simple implementation of String B-trees. In: Proc. SPIRE String Processing and Information Retrieval Symp., pp. 214-215, Padova (2004).
[75]
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 1-61 (2007).
[76]
Ooi, B.C., Tan, K.: B-trees: Bearing fruits of all kinds. In: Proc. Australasian Database Conf., pp. 13-20, Melbourne (2002).
[77]
Oracle: Berkeley DB, Oracle Embedded Database (2007). http:// www.oracle.com/technology/software/products/berkeley-db/ index.html. Version 4.5.20.
[78]
Pagh, R.: Basic external memory data structures. In: Algorithms for Memory Hierarchies, pp. 14-35. Dagstuhl Research Seminar, Schloss Dagstuhl (2002).
[79]
Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668-676 (1990).
[80]
Rao, J., Ross, K.A.: Making B+-trees cache conscious in main memory. In: Proc. ACM SIGMOD Int. Conf. on the Management of Data, pp. 475-486, Dallas (2000).
[81]
Rose, K.R.: Asynchronous generic key/value database. Master's thesis, Massachusetts Institute of Technology (2000).
[82]
Rosenberg, A.L., Snyder, L.: Time and space optimality in B-trees. ACM Trans. Database Systems 6(1), 174-193 (1981).
[83]
Sedgewick, R.: Algorithms in C, Parts 1-4: Fundamentals, Data structures, Sorting, and Searching, 3rd edn. Addison-Wesley, Boston (1998).
[84]
Severance, D.G.: Identifier search mechanisms: a survey and generalized model. ACM Comput. Surv. 6(3), 175-194 (1974).
[85]
Sherk, M.: Self-adjusting k-ary search trees. In: Proc. of Workshop on Algorithms and Data Structures, pp. 381-392, Ottawa (1989).
[86]
Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 7th edn. Wiley, Boston (2004).
[87]
Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM 32(3), 652-686 (1985).
[88]
Software, T.M.: C++ string B-tree library (2007). http:// wikipedia-clustering.speedblue.org/strBTree.php
[89]
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley, New York City (2001).
[90]
Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. Int. J. Very Large Databases 14(3), 281- 299 (2005).
[91]
Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209-271 (2001).
[92]
Williams, H.E., Zobel, J., Heinz, S.: Self-adjusting trees in practice for large text collections. Softw. Practice Exp. 31(10), 925- 939 (2001).
[93]
Witten, I.H., Bell, T.C., Moffat, A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 1st edn. Morgan Kaufmann, San Francisco (1999).
[94]
Yao, A.C.: On random 2-3 trees. Acta Inf. 9, 159-170 (1978).
[95]
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38, 1-56 (2006).
[96]
Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted files versus signature files for text indexing. ACM Trans. Database Systems 23(4), 453-490 (1998).

Cited By

View all
  • (2023)DILI: A Distribution-Driven Learned IndexProceedings of the VLDB Endowment10.14778/3598581.359859316:9(2212-2224)Online publication date: 1-May-2023
  • (2021)Dynamic interleaving of content and structure for robust indexing of semi-structured hierarchical dataProceedings of the VLDB Endowment10.14778/3401960.340196313:10(1641-1653)Online publication date: 10-Mar-2021
  • (2021)Inserting Keys into the Robust Content-and-Structure (RCAS) IndexAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_10(121-135)Online publication date: 24-Aug-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 18, Issue 1
January 2009
373 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2009

Author Tags

  1. B-tree
  2. Burst trie
  3. Data structures
  4. Secondary storage
  5. Vocabulary accumulation
  6. Word-level indexing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)7
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)DILI: A Distribution-Driven Learned IndexProceedings of the VLDB Endowment10.14778/3598581.359859316:9(2212-2224)Online publication date: 1-May-2023
  • (2021)Dynamic interleaving of content and structure for robust indexing of semi-structured hierarchical dataProceedings of the VLDB Endowment10.14778/3401960.340196313:10(1641-1653)Online publication date: 10-Mar-2021
  • (2021)Inserting Keys into the Robust Content-and-Structure (RCAS) IndexAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_10(121-135)Online publication date: 24-Aug-2021
  • (2018)Developing a Holistic Understanding of Systems and Algorithms through Research PapersProceedings of the 2017 ITiCSE Conference on Working Group Reports10.1145/3174781.3174786(86-104)Online publication date: 30-Jan-2018
  • (2017)An analysis on the performance of hash table-based dictionary implementations with different data usage modelsInternational Journal of High Performance Computing and Networking10.5555/3070823.307083110:1-2(78-90)Online publication date: 1-Jan-2017
  • (2015)Oracle Workload IntelligenceProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2742791(1669-1681)Online publication date: 27-May-2015
  • (2011)Redesigning the string hash table, burst trie, and BST to exploit cacheACM Journal of Experimental Algorithmics10.1145/1671970.192170415(1.1-1.61)Online publication date: 7-Feb-2011
  • (2009)Fast and compact hash tables for integer keysProceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 9110.5555/1862659.1862675(113-122)Online publication date: 1-Jan-2009

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media