[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Parallel Sort Merge Join Algorithm for Managing Data Skew

Published: 01 January 1993 Publication History

Abstract

A parallel sort-merge-join algorithm which uses a divide-and-conquer approach to address the data skew problem is proposed. The proposed algorithm adds an extra, low-cost scheduling phase to the usual sort, transfer, and join phases. During the schedulingphase, a parallelizable optimization algorithm, using the output of the sort phase,attempts to balance the load across the multiple processors in the subsequent joinphase. The algorithm naturally identifies the largest skew elements, and assigns each ofthem to an optimal number of processors. Assuming a Zipf-like distribution of data skew,the algorithm is demonstrated to achieve very good load balancing for the join phase, andis shown to be very robust relative, among other things, to the degree of data skew andthe total number of processors.

References

[1]
{1} A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms. Reading, MA: Addison-Wesley, 1974.]]
[2]
{2} S. G. Akl and N. Santoro, "Optimal parallel merging and sorting without memory conflicts," IEEE Trans. Comput., vol. C-36, no. 11, pp. 1367-1369, 1987.]]
[3]
{3} L. Bic and R. L. Hartman, "Hither hundreds of processors in a database machine," in Proc. 1985 Int. Workshop Database Machines, Springer-Verlag, 1985.]]
[4]
{4} M. Blasgen and K. Eswaran, "Storage and access in relational databases," IBM Syst. J., vol. 4, p. 363, 1977.]]
[5]
{5} M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan, "Time bounds for selection," J. Comput. Syst. Sci., vol. 7, pp. 448-461, 1972.]]
[6]
{6} H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin, B. Hart, M. Smith, and P. Valduriez, "Prototyping Bubba, A highly parallel database system," IEEE Trans. Knowledge Data Eng., vol. 2, no. 1, pp. 4-24, Mar. 1990.]]
[7]
{7} S. Christodoulakis "Estimating record selectivities," Inform. Syst., vol. 8, no. 2, pp. 105-115, 1983.]]
[8]
{8} E. Coffman and P. J. Denning, Operating Systems Theory. Englewood Cliffs, NJ: Prentice-Hall, 1973.]]
[9]
{9} E. Coffman, M. Garey, and D. S. Johnson, "An application of bin packing to multiprocessor scheduling," SIAM J. Comput., vol. 7, pp. 1-17, 1978.]]
[10]
{10} D. W. Cornell, D. M. Dias, and P. S. Yu, "On multisystem coupling through function request shipping," IEEE Trans. Software Eng., vol. SE-12, no. 10, pp. 1006-1017, 1986.]]
[11]
{11} S. A. Demurjian, D. K. Hsiao, D. S. Kerr, J. Menon, P. R. Strawser, R. C. Tekampe, J. Trimble, and R. J. Watson, "Performance evaluation of a database system in multiple backend configurations," in Proc. 1985 Int. Workshop Database Machines, Springer-Verlag, 1985.]]
[12]
{12} D. J. Dewitt and R. H. Gerber "Multiprocessor hash-based join algorithms," in Proc. 11th Int. Conf. Very Large Databases, 1985.]]
[13]
{13} D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen, "The GAMMA database machine project," IEEE Trans. Knowledge Data Eng., vol. 2, no. 1, pp. 44-62, Mar. 1990.]]
[14]
{14} D. J. DeWitt, M. Smith, and H. Boral, "A single-user performance evaluation of the Teradata database machine," MCC Tech. Rep. DB- 081-87, 1987.]]
[15]
{15} G. Frederickson and D. B. Johnson, "The complexity of selection and ranking in X+Y and matrices with sorted columns," J. Comput. Syst. Sci., vol. 24, pp. 197-209, 1982.]]
[16]
{16} G. Frederickson and D. B. Johnson, "Generalized selection and ranking: Sorted matrices," SIAM J. Comput., vol. 13, pp. 14-30, 1984.]]
[17]
{17} Z. Galil and N. Megiddo "A fast selection algorithm and the problem of optimum distribution of effort," J. ACM, vol. 26, pp. 58-64, 1979.]]
[18]
{18} R. Graham "Bounds on multiprocessing timing anomalies," SIAM J. Appl. Mathemat., vol. 17, no. 2, pp. 416-429, 1969.]]
[19]
{19} D. K. Hsiao, Advanced Database Machine Architecture. Englewood Cliffs, NJ: Prentice-Hall, 1983.]]
[20]
{20} T. Ibaraki and N. Katoh, Resource Allocation Problems. Cambridge, MA: M.I.T. Press, 1988.]]
[21]
{21} B. R. Iyer and D. M. Dias, "System issues in parallel sorting for database systems," in Proc. 6th Int. Conf. Data Eng., 1988.]]
[22]
{22} B. R. Iyer, G. R. Ricard, and P. J. Varman, "Percentile finding algorithm for multiple sorted runs," in Proc. 15th Int. Conf. Very Large Databases, 1989.]]
[23]
{23} W. Kim, "A new way to compute the product and join of relations," in Proc. ACM SIGMOD Conf., Santa Monica, CA, 1980.]]
[24]
{24} M. Kitsuregawa, H. Tanaka, and T. Motooka, "Application of hash to data base machine and its architecture," New Generation Comput., vol. 1, no. 1, 1983.]]
[25]
{25} D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching. Reading, MA: Addison-Wesley, 1973.]]
[26]
{26} S. Lakshmi and P. S. Yu, "Effectiveness of parallel processing in database systems," Comput. Syst. Sci. Eng., vol. 5, no. 2, pp. 73-81, 1990.]]
[27]
{27} S. Lakshmi and P. S. Yu, "Effectiveness of parallel joins," IEEE Trans. Knowledge Data Eng., vol. 2, no. 4, pp. 410-424, 1990.]]
[28]
{28} C. A. Lynch, "Selectivity estimation and query optimization in large databases with highly skewed distributions of column values," in Proc. 14th Int. Conf. Very Large Databases, 1988.]]
[29]
{29} A. Y. Montgomery, D. J. D'Souza, and S. B. Lee, "The cost of relational algebraic operations on skewed data: Estimates and experiments," in Inform. Processing 83, IFIP, 1983.]]
[30]
{30} P. M. Neches and J. E. Shemer, "The genesis of a database computer," IEEE Comput. Mag., vol. 17, no. 11, pp. 42-56, 1984.]]
[31]
{31} E. Ozkarahan, Database Machines and Database Management. Englewood Cliffs, NJ: Prentice-Hall, 1986.]]
[32]
{32} G. Z. Qadah, "The equi-join operation on a multiprocessor database machine: Algorithms and the evaluation of their performance," in Proc. 1985 Int. Workshop Database Machines, Springer-Verlag, 1985, pp. 35-67.]]
[33]
{33} S. Salza, M. Terranova, and P. Velardi, "Performance modeling of the DBMAC architecture," in Proc. 1983 Int. Workshop Database Machines, Springer-Verlag, 1983, pp. 74-90.]]
[34]
{34} D. A. Schneider and D. J. DeWitt, "A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment," in Proc. ACM SIGMOD Conf., Portland, OR, 1989, pp. 110-121.]]
[35]
{35} M. Stonebraker, "The case for shared nothing," IEEE Database Eng., vol. 9, no. 1, 1986.]]
[36]
{36} A. N. Tantawi, D. Towsley, and J. Wolf, "Optimal allocation of multiple class resources in computer systems," in Proc. ACM SIGMETRICS Conf., Santa Fe, NM, 1988, pp: 253-260.]]
[37]
{37} P. Valduriez and G. Gardarin, "Join and semi-ioin algorithms for a multiprocessor database machine," ACM Trans. Database Syst., vol. 9, no. 1, pp. 133-161, 1984.]]
[38]
{38} J. L. Wolf, D. M. Dias, and P. S. Yu, "An effective algorithm for parallelizing hash joins in the presence of data skew," in Proc. 7th Int. Data Eng. Conf., Kobe, Japan, pp. 200-209.]]
[39]
{39} J. L. Wolf, B. Iyer, K. Pattipati, and J. Turek, "Optimal buffer partitioning for the nested block join algorithm," in Proc. 7th Int. Data Eng. Conf., Kobe, Japan, pp. 510-519.]]
[40]
{40} J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek, "Comparative performance of parallel join algorithms," in Proc. 1st Int. Conf. Parallel Distributed Syst., Miami, FL, 1991, pp. 78-88.]]
[41]
{41} G. K. Zipf, Human Behavior and the Principle of Least Effort. Reading, MA: Addison-Wesley, 1949.]]

Cited By

View all
  • (2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
  • (2015)Skew-Aware Join Optimization for Array DatabasesProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2723709(123-135)Online publication date: 27-May-2015
  • (2015)Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataundefinedOnline publication date: 27-May-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 4, Issue 1
January 1993
118 pages

Publisher

IEEE Press

Publication History

Published: 01 January 1993

Author Tags

  1. Index Termsdata skew management
  2. Zipf-like distribution
  3. distributed databases
  4. divide-and-conquer
  5. join phases
  6. load balancing
  7. merging
  8. multiple processors
  9. parallel algorithms
  10. parallel sort merge join algorithm
  11. parallelizable optimization algorithm
  12. relational algebra
  13. relational databases
  14. scheduling phase
  15. sort phase
  16. sorting
  17. transfer phase

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
  • (2015)Skew-Aware Join Optimization for Array DatabasesProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2723709(123-135)Online publication date: 27-May-2015
  • (2015)Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataundefinedOnline publication date: 27-May-2015
  • (2009)Efficient outer join data skew handling in parallel DBMSProceedings of the VLDB Endowment10.14778/1687553.16875652:2(1390-1396)Online publication date: 1-Aug-2009
  • (2009)Sort vs. Hash revisitedProceedings of the VLDB Endowment10.14778/1687553.16875642:2(1378-1389)Online publication date: 1-Aug-2009
  • (2008)Handling data skew in parallel joins in shared-nothing systemsProceedings of the 2008 ACM SIGMOD international conference on Management of data10.1145/1376616.1376720(1043-1052)Online publication date: 9-Jun-2008
  • (2002)Parallel database sortingInformation Sciences: an International Journal10.1016/S0020-0255(02)00196-2146:1-4(171-219)Online publication date: 1-Oct-2002
  • (2001)Distributed Query Processing in the InternetProceedings of the The 21st International Conference on Distributed Computing Systems10.5555/876878.879319Online publication date: 16-Apr-2001
  • (2001)The Maximum Factor Queue Length Batching Scheme for Video-on-Demand SystemsIEEE Transactions on Computers10.1109/12.90898750:2(97-110)Online publication date: 1-Feb-2001
  • (1997)Disk load balancing for video-on-demand systemsMultimedia Systems10.1007/s0053000500675:6(358-370)Online publication date: 1-Dec-1997
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media