[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Sort vs. Hash Revisited

Published: 01 December 1994 Publication History

Abstract

Efficient algorithms for processing large volumes of data are very important both for relational and new object-oriented database systems. Many query-processing operations can be implemented using sort- or hash-based algorithms, e.g. intersections, joins, and duplicate elimination. In the early relational database systems, only sort-based algorithms were employed. In the last decade, hash-based algorithms have gained acceptance and popularity, and are often considered generally superior to sort-based algorithms such as merge-join. In this article, we compare the concepts behind sort- and hash-based query-processing algorithms and conclude that (1) many dualities exist between the two types of algorithms, (2) their costs differ mostly by percentages rather than by factors, (3) several special cases exist that favor one or the other choice, and (4) there is a strong reason why both hash- and sort-based algorithms should be available in a query-processing system. Our conclusions are supported by experiments performed using the Volcano query execution engine.

References

[1]
{1} A. Aggarval and J. S. Vitter, "The input/output complexity of sorting and related problems," Commun. ACM vol. 31, p. 1116, Oct. 1988.
[2]
{2} M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, and V. Watson, "System R: A relational approach to database management," ACM Trans. Database Syst. vol. 1, no. 2, p. 97, June 1976 (reprinted in M. Stonebraker, Readings in Database Systems. San Mateo, CA: Morgan Kaufmann, 1988.
[3]
{3} M. Beck, D. Bitton, and W. K. Wilkinson, "Sorting large files on a backend multiprocessor," IEEE Trans. Comput., vol. 37, p. 769, 1988.
[4]
{4} D. Bitton and D. J. Dewitt, "Duplicate record elimination in large data files," ACM Trans. Database Syst., vol. 8, no. 2, p. 255----, June 1983.
[5]
{5} D. Bitton Friedland, "Design, analysis, and implementation of parallel external sorting algorithms," Comput. Sci. Tech. Rep. 464, University of Wisconsin--Madison, Jan. 1982.
[6]
{6} M. Blasgen and K. Eswaran, "On the evaluation of queries in a relational database system," IBM Res. Rep. RJ-1745, San Jose, CA, USA, Apr. 8, 1976.
[7]
{7} M. Blasgen and K. Eswaran, "Storage and access in relational databases," IBM Syst. J., vol. 16, no. 4, 1977.
[8]
{8} K. Bratbergsengen, "Hashing methods and relational algebra operations," Proc. Int. Conf. Very Large Data Bases, 1984, p. 323.
[9]
{9} H. T. Chou, D. J. DeWitt, R. H. Katz, and A. C. Klug, "Design and implementation of the Wisconsin storage system," Software: Practice and Experience, vol. 15, no. 10, p. 943, Oct. 1985.
[10]
{10} D. J. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. Wood, "Implementation techniques for main memory database systems," Proc. ACM SIGMOD Conf., 1984, p. 1.
[11]
{11} D. J. DeWitt and R. H. Gerber, "Multiprocessor hash-based join algorithms," Proc. Int. Conf. Very Large Data Bases, 1985, p. 151.
[12]
{12} D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna, "GAMMA: A high performance dataflow data-base machine," Proc. Int. Conf. Very Large Data Bases, 1986, p. 228 (reprinted in M. Stonebraker, Readings in Database Systems. San Mateo, CA: Morgan Kaufmann, 1988).
[13]
{13} D. J. Dewitt, S. Ghandeharizadeh, and D. Schneider "A performance analysis of the GAMMA database machine," Proc. ACM SIGMOD Conf., 1988, p. 350.
[14]
{14} D. J. Dewitt, "The Wisconsin benchmark: Past, present, and future," in J. Gray, Ed., Database and Transactions Processing Systems Performance Handbook. San Mateo, CA: Morgan Kaufmann, 1991.
[15]
{15} D. J. Dewitt, J. Naughton, and D. Schneider, "Parallel sorting on a shared-nothing architecture using probabilistic splitting," Proc. Int. Conf. Parallel Distrib. Inform. Syst., Miami Beach, FL, USA, Dec. 1991.
[16]
{16} R. Epstein, "Techniques for processing of aggregates in relational database systems," UCB/Electron. Res. Lab. Memo. M79/8, Univ. of California, Feb. 1979.
[17]
{17} R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong, "Extendible hashing: A fast access method for dynamic files," ACM Trans. Database Syst., Vol. 4, no. 3, p, 315-----, Sept. 1979.
[18]
{18} S. Fushimi, M. Kitsuregawa, and H. Tanaka, "An overview of the system software of a parallel relational database machine GRACE," Proc. Int. Conf. Very Large Data Bases, Kyoto, Japan, Aug. 1986.
[19]
{19} G. Graefe, "Encapsulation of parallelism in the Volcano query processing system," Proc. ACM SIGMOD Conf., 1990, p. 102-----.
[20]
{20} G. Graefe and D. L. Davison, "Encapsulation of parallelism architecture-independence in extensible database query processing," IEEE TRans. Software Eng., vol. 19, no. 8, pp. 747, Aug. 1993.
[21]
{21} G. Graefe, "Parallel external sorting in Volcano," Tech. Rep. 459, Univ. of Colorado, Boulder, USA, Dept. Comput. Sci., 1991.
[22]
{22} G. Graefe, "Volcano: An extensible and parallel dataflow query processing system," IEEE Trans. Knowledge. Data Eng., vol. 6, no. 1, pp. 120-135, Feb. 1994.
[23]
{23} G. Graefe and S. S. Thakkar, "Tuning a parallel database algorithm on a shared-memory multiprocessor," Software--Practice and Experience, vol. 22, no. 7, p.495, July 1992.
[24]
{24} L. M. Haas, W. F. Cody, J. C. Freytag, G. Lapis, B. G. Lindsay, G. M. Lohman, K. Ono, and H. Pirahesh, "An extensible processor for an extended relational query language," Comput. Sci. Res. Rep., San Jose, CA, USA, Apr. 1988.
[25]
{25} B. R. Iyer and D. M. Dias, "System issues in parallel sorting for database systems," Proc. IEEE Conf. Data Eng. 1990, p. 246.
[26]
{26} T. Keller and G. Graefe, "The one-to-one match operator of the Volcano query processing system," Oregon Graduate Center, Comput. Sci. Tech. Rep., Beaverton, OR, USA, June 1989.
[27]
{27} A. Kemper and G. Moerkotte, "Access support in object bases," Proc. ACM SIGMOD Conf., 1990, p. 364.
[28]
{28} M. Kitsuregawa, H. Tanaka, and T. Motooka, "Application of hash to data base machine and its architecture," New Generation Computing, vol. 1, 1983.
[29]
{29} A. M. Kitsuregawa, M. Nakayama, and M. Takagi, "The effect of bucket size tuning in the dynamic hybrid GRACE hash join method," Proc. Int. Conf. Very Large Data Bases, 1989, p. 257.
[30]
{30} A. Klug, "Access paths in the 'ABE' statistical query facility," Proc. ACM SIGMOD Conf., 1982, p. 161.
[31]
{31} D. Knuth, The Art of Computer Programming: Sorting and Searching, vol. III. Reading, MA: Addison-Wesley, 1973
[32]
{32} R. P. Kooi, "The optimization of queries in relational databases," Ph.D. dissertation, Case Western Reserve Univ., OH, USA, Sept. 1980.
[33]
{33} R. A. Lorie and H. C. Young, "A low communication sort algorithm for a parallel database machine," Proc. Int. Conf. Very Large Data Bases, 1989, p. 125.
[34]
{34} J. Menon, "A study of sort algorithms for multiprocessor database machines," Proc. Int. Conf. Very Large Data Bases, 1986, p. 197.
[35]
{35} M. Nakayama, M. Kitsuregawa, and M. Takagi, "Hash-partitioned join method using dynamic destaging strategy," Proc. Int. Conf. Very Large Data Bases, 1988, p. 468.
[36]
{36} J. Ousterhout, "Why aren't operating systems getting faster as fast as hardware?" WRL Tech. Rep. TN-11, Palo Alto, CA, USA, Oct. 1989.
[37]
{37} J. E. Richardson and M. J. Carey, "Programming constructs for database system implementation in EXODUS," Proc. ACM SIGMOD Conf., 1987, p. 208.
[38]
{38} B. Salzberg, File Structures: An Analytic Approach. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[39]
{39} B. Salzberg, "Merging sorted runs using large main memory," Acta Informatica , vol. 27, p. 195, 1990.
[40]
{40} B. Salzberg, A. Tsukerman, J. Gray, M. Stewart, S. Uren, and B. Vaughan "FastSort: A distributed single-input single-output external sort," Proc. ACM SIGMOD Conf., 1990, p. 94.
[41]
{41} D. Schneider and D. Dewitt, "A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment," Proc. ACM SIGMOD Conf., 1989, p. 110.
[42]
{42} P. G. Selinger, M. M. Astrahan, D.D. Chamberlin, R. A. Lorie, and T. G. Price, "Access path selection in a relational database management system," Proc. ACM SIGMOD Conf., 1979, p. 23 (reprinted in M. Stonebraker, Readings in Database Systems. San Mateo, CA: Morgan-Kaufman, 1988).
[43]
{43} L. D. Shapiro, "Join processing in database systems with large main memories," ACM Trans. Database Syst., vol. 11, no. 3, p. 239, Sept. 1986.
[44]
{44} M. Stonebraker, E. Wong, P. Kreps, and G. D. Held, "The Design and Implementation of INGRES," ACM Trans. Database Syst., vol. 1, no. 3, p. 189, Sept. 1976 (reprinted in M. Stonebraker, Readings in Database Systems. San Mateo, CA: Morgan-Kaufman, 1988).
[45]
{45} Teradata Corp., DBC/1012, Data Base Computer, Concepts, and Facilities , Los Angeles, CA, USA, 1983.
[46]
{46} S. Todd, "PRTV: An efficient implementation for large relational data bases," Proc. Int. Conf. Very Large Data Bases, 1975, p. 554.
[47]
{47} H. Zeller and J. Gray, "an adaptive hash join algorithm for multiuser environments," Proc. Int. Conf. Very Large Data Bases, Brisbane, Australia, 1990.
[48]
{48} G. K. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Reading, MA: Addison-Wesley, 1949.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 6, Issue 6
December 1994
158 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 December 1994

Author Tags

  1. Volcano query execution engine
  2. costs
  3. database theory
  4. dualities
  5. duplicate elimination
  6. file organisation
  7. hash-based algorithms
  8. intersections
  9. joins
  10. merge-join algorithm
  11. object-oriented database systems
  12. object-oriented databases
  13. performance
  14. query processing
  15. query-processing operations
  16. relational database systems
  17. relational databases
  18. sort-based algorithms
  19. sorting
  20. value matching

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2020)ColumnBurstProceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3409963.3410494(9-16)Online publication date: 24-Aug-2020
  • (2017)Advanced Block Nested Loop Join for Extending SSD LifetimeIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.265180329:4(743-756)Online publication date: 1-Apr-2017
  • (2016)ParTimeProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2903732(999-1010)Online publication date: 26-Jun-2016
  • (2013)Multi-core, main-memory joinsProceedings of the VLDB Endowment10.14778/2732219.27322277:1(85-96)Online publication date: 1-Sep-2013
  • (2012)hStorage-DBProceedings of the VLDB Endowment10.14778/2336664.23366795:10(1076-1087)Online publication date: 1-Jun-2012
  • (2012)Advanced partitioning techniques for massively distributed computationProceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213839(13-24)Online publication date: 20-May-2012
  • (2012)New algorithms for join and grouping operationsComputer Science - Research and Development10.1007/s00450-011-0186-927:1(3-27)Online publication date: 1-Feb-2012
  • (2011)Designing fast architecture-sensitive tree search on modern multicore/many-core processorsACM Transactions on Database Systems10.1145/2043652.204365536:4(1-34)Online publication date: 19-Dec-2011
  • (2009)Sort vs. Hash revisitedProceedings of the VLDB Endowment10.14778/1687553.16875642:2(1378-1389)Online publication date: 1-Aug-2009
  • (2008)A case for flash memory ssd in enterprise database applicationsProceedings of the 2008 ACM SIGMOD international conference on Management of data10.1145/1376616.1376723(1075-1086)Online publication date: 9-Jun-2008
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media