Abstract
Performance portability is becoming more-and-more important as next-generation high performance computing systems grow increasingly diverse and heterogeneous. Several new approaches to parallel programming, such as SYCL and Kokkos, have been developed in recent years to tackle this challenge. While several studies have been published evaluating these new programming models, they have tended to focus on memory-bandwidth bound applications. In this paper we analyse the performance of what appear to be the most promising modern parallel programming models, on a diverse range of contemporary high-performance hardware, using a compute-bound molecular docking mini-app.
We present miniBUDE, a mini-app for BUDE, the Bristol University Docking Engine, a real application routinely used for drug discovery. We benchmark miniBUDE on real-world inputs for the full-scale application in order to follow its performance profile closely in the mini-app. We implement the mini-app in different programming models targeting both CPUs and GPUs, including SYCL and Kokkos, two of the more promising and widely used modern parallel programming models. We then present an analysis of the performance of each implementation, which we compare to highly optimised baselines set using established programming models such as OpenMP, OpenCL, and CUDA. Our study includes a wide variety of modern hardware platforms covering CPUs based on \(\times \)86 and Arm architectures, as well as GPUs.
We found that, with the emerging parallel programming models, we could achieve performance comparable to that of the established models, and that a higher-level framework such as SYCL can achieve OpenMP levels of performance while aiding productivity. We identify a set of key challenges and pitfalls to take into account when adopting these emerging programming models, some of which are implementation-specific effects and not fundamental design errors that would prevent further adoption. Finally, we discuss our findings in the wider context of performance-portable compute-bound workloads.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Laguna, I., et al.: A large-scale study of MPI usage in open-source HPC applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, Denver (2019). https://doi.org/10.1145/3295500.3356176. ISBN 9781450362290
Bernholdt, D.E., et al.: A survey of MPI usage in the US exascale computing project. Concurr. Comput. Pract. Exp. 32(3), e4851 (2020)
Deakin, T., et al.: Performance portability across diverse computer architectures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, Denver, pp. 1–13, November 2019. https://doi.org/10.1109/P3HPC49587.2019.00006. ISBN 978-1-72816-003-0
Deakin, T., et al.: Tracking performance portability on the yellow brick road to exascale. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Atlanta, GA, USA, p. 13. In press
McIntosh-Smith, S., et al.: High performance in silico virtual drug screening on many-core processors. Int. J. High Perf. Comput. Appl. 29(2), 119–134 (2015). https://doi.org/10.1177/1094342014528252
Cherfils, J., Janin, J.: Protein docking algorithms: simulating molecular recognition. Current Opinion Struct. Biol. 3(2), 265–269 (1993). https://doi.org/10.1016/S0959-440X(05)80162-9. ISSN 0959–440X
Fuchs, A., Wentzla, D.: The accelerator wall: limits of chip specialization. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1–14 (2019). https://doi.org/10.1109/HPCA.2019.00023
Price, J., McIntosh-Smith, S.: Exploiting auto-tuning to analyze and improve performance portability on many-core architectures. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 538–556. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67630-2_38
Katz, M.P., et al.: Preparing nuclear astrophysics for exascale. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2020), Atlanta, GA, USA, November 2020, in press
Siegel, A.: ECP: lessons learned in porting complex applications to accelerator-based systems. Presentation, Atlanta, GA, USA (2020)
Heroux, M.A., et al.: ECP software technology capability assessment report-public. Technical report, NNSA, p. 200 (2020)
Lambert, J., et al.: CCAMP: an integrated translation and optimization framework for OpenACC and OpenMP. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2020), Atlanta, GA, USA, November 2020, in press
Mills, R.T., et al.: Toward performance-portable PETSc for GPU-based exascale systems. In: arXiv preprint arXiv:2011.00715 (2020)
Carter Edwards, H., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Extreme Scaling Workshop (XSW 2013). IEEE, pp. 18–24 (2013)
Hammond, J.R., Kinsner, M., Brodman, J.: A comparative analysis of Kokkos and SYCL as heterogeneous, parallel programming models for C++ applications. In: Proceedings of the International Workshop on OpenCL, IWOCL 2019. Association for Computing Machinery, Boston (2019). https://doi.org/10.1145/3318170.3318193. ISBN 9781450362306
Intel: Intel® oneAPI: A Unied X-Architecture Programming Model (2020). https://software.intel.com/content/www/us/en/develop/tools/oneapi.html. Accessed 16 Dec 2020
Codeplay Software: ComputeCPP. https://developer.codeplay.com/products/computecpp/ce/home. Accessed 16 Dec 2020
Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of HipSYCL. In: Proceedings of the International Workshop on OpenCL. Association for Computing Machinery, Munich (2020). https://doi.org/10.1145/3388333.3388658. ISBN 9781450375313
Harrell, S.L., et al.: Effective performance portability. In: 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 24–36 (2018). https://doi.org/10.1109/P3HPC.2018.00006
Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2019). https://doi.org/10.1016/j.future.2017.08.007. ISSN 0167–739X
Sewall, J., et al.: Interpreting and visualizing performance portability metrics. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Atlanta, GA, USA (2020, in Press)
Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPCStyle SYCL applications. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery, Munich (2020). https://doi.org/10.1145/3388333.3388643. ISBN 9781450375313
Lin, W.-C., Deakin, T., McIntosh-Smith, S.: On measuring the maturity of SYCL implementations by tracking historical performance improvements. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery (2021, in Press)
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489–507. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_34
Martineau, M., Atkinson, P., McIntosh-Smith, S.: Benchmarking the NVIDIA V100 GPU and tensor cores. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 444–455. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_35
Reyes, R., et al.: SYCL 2020: more than meets the eye. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery, Munich (2020). https://doi.org/10.1145/3388333.3388649. ISBN 9781450375313
Acknowledgement
The authors would like to thank Si Hammond at Sandia National Laboratories for providing short-notice results for the A64FX platform. Thank you to James Price and Matt Martineau for their original contributions towards optimised OpenMP, OpenCL, and CUDA implementations of the BUDE kernel. This study would not have been possible without previous work by the developers of the Bristol University Docking Engine: Richard Sessions, Deborah Shoemark, and Amaurys Avila Ibarra.
This work used the Isambard UK National Tier-2 HPC Service (https://gw4.ac.uk/isambard/) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1). Access to the Cray XC50 supercomputer Swan was kindly provided through the Cray Marketing Partner Network. Work in this study was carried out using the HPC Zoo, a research cluster run by the University of Bristol HPC Group (https://uob-hpc.github.io/zoo/).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Poenaru, A., Lin, WC., McIntosh-Smith, S. (2021). A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-78713-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78712-7
Online ISBN: 978-3-030-78713-4
eBook Packages: Computer ScienceComputer Science (R0)