[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1514894.1514907acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free access

Optimal splitters for database partitioning with size bounds

Published: 23 March 2009 Publication History

Abstract

Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equal-sized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (splitters) for a data set. Unlike traditional partitioning elements, our splitters define both inequality and equality partitions, which allows us to bound the size of the inequality partitions. We provide an algorithm for determining an optimal set of splitters from a sorted data set and show that it has time complexity O(k lg2 N), where k is the number of splitters requested and N is the size of the data set. We show how the algorithm can be extended to pairs of tables, so that joins can be partitioned into work units that have balanced cost. We demonstrate experimentally (a) that finding the optimal set of splitters can be done efficiently, and (b) that using the precomputed splitters can improve the time to sort a data set by up to 76%, with particular benefits in the presence of a few heavy hitters.

References

[1]
R. Agrawal and A. Swami. A one-pass space-efficient algorithm for finding quantiles. In COMAD, 1995.
[2]
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998.
[3]
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Conf. on Parallel and Dist. Info. Sys., pages 280--291, 1991.
[4]
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992.
[5]
J. Doweck. Inside intel core microarchitecture. In Hot Chips, August 2006.
[6]
V. Gaede and O. Günther. Multidimensional access methods. ACM Comput. Surv., 30(2):170--231, 1998.
[7]
S. Ghandeharizadeh and D. J. DeWitt. Hybrid-range partitioning strategy: A new declustering strategy for multiprocessor database machines. In VLDB, pages 481--492, 1990.
[8]
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.
[9]
A. P. Gurajada and J. Srivastava. Equidepth partitioning of a data set based on finding its medians. In Applied Computing, pages 92--101, 1991.
[10]
P. J. Haas and A. N. Swami. Sampling-based selectivity estimation for joins using augmented frequent value statistics. In ICDE, pages 522--531, 1995.
[11]
K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In VLDB, pages 525--535, 1991.
[12]
J. Huang and Y. Chow. Parallel sorting and data partitioning by sampling. In Proc. of Symposium on Parallel Algorithms and Architectures, pages 627--631, 1983.
[13]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003.
[14]
Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In SIGMOD, pages 233--244, 1995.
[15]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In VLDB, pages 174--185, 1999.
[16]
B. R. Iyer et al. Percentile finding algorithm for multiple sorted runs. In VLDB, pages 135--144, 1989.
[17]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, pages 275--286, 1998.
[18]
S. Khanna, S. Muthukrishnan, and S. Skiena. Efficient array partitioning. In ICALP, pages 616--626, 1997.
[19]
M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990.
[20]
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD, pages 426--435, 1998.
[21]
C. Martínez and S. Roura. Optimal sampling strategies in quicksort and quickselect. SIAM J. Comput., 31(3):683--705, 2001.
[22]
D. R. Musser. Introspective sorting and selection algorithms. Softw., Pract. Exper., 27(8):983--993, 1997.
[23]
S. Muthukrishnan, V. Poosala, and T. Suel. On rectangular partitionings in two dimensions: Algorithms, complexity, and applications. In ICDT, pages 236--256, 1999.
[24]
V. Poosala and Y. E. Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, pages 448--459, 1996.
[25]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In SIGMOD, pages 294--305, 1996.
[26]
P. Sanders and S. Winkel. Super scalar sample sort. In European Symposium on Algorithms, pages 784--796, 2004.
[27]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICDT '09: Proceedings of the 12th International Conference on Database Theory
March 2009
334 pages
ISBN:9781605584232
DOI:10.1145/1514894
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EDBT/ICDT '09
EDBT/ICDT '09: EDBT/ICDT '09 joint conference
March 23 - 25, 2009
St. Petersburg, Russia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)11
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Balanced splitting on weighted intervalsOperations Research Letters10.1016/j.orl.2015.05.00543:4(396-400)Online publication date: 1-Jul-2015
  • (2014)Multicore Processors and Database SystemsUbiquity10.1145/26184032014:August(1-7)Online publication date: 1-Aug-2014
  • (2014)A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sortProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610522(755-766)Online publication date: 18-Jun-2014
  • (2014)A Comparative Study of Implementation Techniques for Query Processing in Multicore SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.24326:1(3-15)Online publication date: 1-Jan-2014
  • (2014)Hardware Partitioning for Big Data AnalyticsIEEE Micro10.1109/MM.2014.1134:3(109-119)Online publication date: May-2014
  • (2013)Navigating big data with high-throughput, energy-efficient data partitioningACM SIGARCH Computer Architecture News10.1145/2508148.248594441:3(249-260)Online publication date: 23-Jun-2013
  • (2013)Navigating big data with high-throughput, energy-efficient data partitioningProceedings of the 40th Annual International Symposium on Computer Architecture10.1145/2485922.2485944(249-260)Online publication date: 23-Jun-2013
  • (2013)Optimal splitters for temporal and multi-version databasesProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2465310(109-120)Online publication date: 22-Jun-2013
  • (2012)Massively parallel sort-merge joins in main memory multi-core database systemsProceedings of the VLDB Endowment10.14778/2336664.23366785:10(1064-1075)Online publication date: 1-Jun-2012
  • (2011)Optimizing Multiway Joins in a Map-Reduce EnvironmentIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2011.4723:9(1282-1298)Online publication date: Sep-2011
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media