Abstract
The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing.
Similar content being viewed by others
References
Brodal G, Fagerberg R, Moruz G (2008) On the adaptiveness of quicksort. J Exp Algorithm (JEA) 12:Article 3.2
Chen J, Juang P, Ko K, Contreras G, Penry D, Rangan R, Stoler A, Peh L, Martonosi M (2005) Hardware-modulated parallelism in chip multiprocessors. ACM SIGARCH Comput Archit News Arch 33(4):54–63
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press and McGraw-Hill, Cambridge, New York. ISBN 0-262-03293-7. Section 8.4: Bucket Sort, pp 174–177
DeWitt D, Naughton J, Schneider D (1992) Parallel Sorting on Shared Nothing Architectures Using Probabilistic Splitting. In: Proceedings of the 1st Intel conference on parallel and distributed info systems, 1992, pp 280–291
Graefe G (2006) Implementing sorting in database systems. ACM Comput Surv (CSUR) 38(3)
Hammond L, Nayfeh B, Olukotun K (1997) A single-chip multiprocessor. IEEE Comput 30(9):79–85
Intel® (2009) VTune Performance Analyzer for Linux. URL: http://www.intel.com/software/products/vtune/
Intel C++ Compiler for Linux (2009) URL: http://www.intel.com/cd/software/products/asmo-na/eng/compilers/277618.htm
Intel® Core 2 Duo (2009) URL: http://www.intel.com/products/processor/core2duo/index.htm
Intel Hyper-Threading Technology (2002) URL: http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf
Jiménez-González D, Navarro JJ, Larriba-Pey J (1999) Communication and cache conscious radix sort. In: Proceedings of the international conference on supercomputing, 1999, pp 76–83
Jiménez-González D, Navarro JJ, Larriba-Pey J (2001) Fast Parallel in-memory 64-bit Sorting. In: Proceedings of the 15th ACM international conference on supercomputing (ICS), 2001, pp 114–122
Jiménez-González D, Navarro JJ, Larriba-Pey J (2003) CC-Radix: a cache conscious sorting based on radix sort. In: Proceedings of the 11th Euromicro conference on parallel distributed and network-based processing (PDP), 2003, pp 101–108
Knuth D (1997) The art of computer programming, vol 3: sorting and searching, 3rd edn. Addison-Wesley, Reading
LaMarca A, Ladner R (1997) The influence of caches on the performance of sorting. In: Proceeding of the ACM/SIAM symposium on discrete algorithms, 1997, pp 370–379
Larriba-Pey JL, Jimenez D, Navarro J (1997) An analysis of superscalar Sorting Algorithms on an R8000 Processor. In: Proceedings of the 17th international conference of the Chilean Computer Science Society (SCCC), 1997, pp 125–134
Lee S, Jeon M, Kim D, Sohn A (2002) Partition parallel radix sort. J Parallel Distrib Comput 656–668
Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA, Upton M (2002) Hyper-threading technology architecture and microarchitecture. Intel Technol J (Q1):4–15
OpenMP® (2009) URL: http://www.openmp.org/
Rahman N, Raman R (2000) Adapting radix sort to the memory hierarchy. In: Proceedings of the 2nd workshop on algorithm engineering and experiments (ALENEX), 2000, pp 131–146
Rahman N, Raman R (2000) Analysing the cache behaviour of non-uniform distribution sorting algorithms. In: Proceedings of the European symposium on algorithms (ESA), 2000, pp 380–391
Sedgewick R (1978) Implementing quicksort programs. Commun ACM 21:847–857
Sohn A, Kodama Y (1998) Load balanced parallel radix sort. In: Proceeding of the international conference of supercomputing, 1998, pp 305–312
Tsigas P, Zhang Yi (2002) Parallel quicksort seems to outperform sample sort on cache-coherent shared memory multiprocessors: an evaluation on SUN ENTERPRISE 10000. Technical Report 2002-03, Department of Computer Science, Chalmers University of Technology
Tsigas P, Zhang Yi (2003) A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on Sun Enterprise 10000. In: Proceedings of the 11th EUROMICRO conference on parallel distributed and network-based processing (PDP), 2003, pp 372–381
Tullsen D, Eggers S, Levy H (1995) Simultaneous multithreading: maximizing on-chip parallelism. In: Proceedings of the 22nd annual international symposium on computer architecture, (ISCA), 1995
Xiao L, Zhang X, Kubricht SA (2000) Improving memory performance of sorting algorithms. ACM J Exp Algorithm 5(3):1–22
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rashid, L., Hassanein, W.M. & Hammad, M.A. Analyzing and enhancing the parallel sort operation on multithreaded architectures. J Supercomput 53, 293–312 (2010). https://doi.org/10.1007/s11227-009-0294-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0294-5