Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform

Andrés Rodríguez¹,
Angeles Navarro¹,
Rafael Asenjo ORCID: orcid.org/0000-0002-1570-3863¹,
Francisco Corbera¹,
Rubén Gran²,
Darío Suárez² &
…
Jose Nunez-Yanez³

804 Accesses
Explore all metrics

Abstract

Heterogeneous computing that exploits simultaneous co-processing with different device types has been shown to be effective at both increasing performance and reducing energy consumption. In this paper, we extend a scheduling framework encapsulated in a high-level C++ template and previously developed for heterogeneous chips comprising CPU and GPU cores, to new high-performance platforms for the data center, which include a cache coherent FPGA fabric and many-core CPU resources. Our goal is to evaluate the suitability of our framework with these new FPGA-based platforms, identifying performance benefits and limitations.We target the state-of-the-art HARP processor that includes 14 high-end Xeon classes tightly coupled to a FPGA device located in the same package. We select eight benchmarks from the high-performance computing domain that have been ported and optimized for this heterogeneous platform. The results show that a dynamic and adaptive scheduler that exploits simultaneous processing among the devices can improve performance up to a factor of 8 × compared to the best alternative solutions that only use the CPU cores or the FPGA fabric. Moreover, our proposal achieves up to 15% and 37% of improvement compared to the best heterogeneous solutions found with a dynamic and static schedulers, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous multiprocessing in a software-defined heterogeneous FPGA

Article Open access 16 April 2018

Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware

On-Chip and Distributed Dynamic Parallelism for Task-based Hardware Accelerators

Article Open access 29 April 2022

Notes

A chunk is a block of consecutive iterations that are independent of other iterations or chunks of the parallel loop.
For the FPGA, the registered computation time includes the data transfer- or map/unmap- times.
nEU = clGetDeviceInfo(deviceId, CL_DEVICE_MAX_COMPUTE_UNITS).
It is not guaranteed that the stabilization criterion is always met. If the number of iterations is not large enough to fully utilize the FPGA, we may finish the computation without leaving the exploration phase.
In the OpenCL standard, the NDRange represents the 3D space of parallel iterations.
This is achieved by allocating the region with clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, size, ...) and then mapping this region to a CPU accessible pointer using clEnqueueMapBuffer().

References

Auerbach J, Bacon DF, Cheng P, Rabbah R (2010) Lime: a java-compatible and synthesizable language for heterogeneous architectures. SIGPLAN Not 45(10):89–108
Google Scholar
Bacon D, Rabbah R, Shukla S (2013) FPGA programming for the masses. Queue 11(2):40:40–40:52. https://doi.org/10.1145/2436696.2443836
Google Scholar
Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’ 09
Belviranli M, Bhuyan L, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):57
Google Scholar
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54
Corp I (2016) Intel FPGA SDK for OpenCL, best practices guide. https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf. Accessed 17 June 2019
Corporation I (2014) Monte carlo pricing of asian options on FPGAs using OpenCL. https://www.altera.com/support/support-resources/design-examples/design-software/opencl/black-scholes.html. Accessed 17 June 2019
Dávila Guzmán MA, Nozal R, Gran Tejero R, Villarroya-Gaudó M, Suárez Gracia D, Bosque JL (2019) Cooperative CPU, GPU, and FPGA heterogeneous execution with Engine CL. J Supercomput 75(3):1732–1746
Google Scholar
Gómez-Luna J, El Hajj I, Chang LW, Garcia-Flores V, Garcia de Gonzalo S, Jablin T, Pena AJ, Hwu Wm (2017) Chai: collaborative heterogeneous applications for integrated-architectures. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Huang W, Ghosh S, Velusamy S, Sankaranarayanan K, Skadron K, Stan MR (2006) Hotspot: a compact thermal modeling methodology for early-stage VLSI design. IEEE Trans Very Large Scale Integr Syst 14(5): 501–513
Google Scholar
Koeplinger D, Prabhakar R, Zhang Y, Delimitrou C, Kozyrakis C, Olukotun K (2016) Automatic generation of efficient accelerators for reconfigurable hardware. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp 115–127
Krommydas K, Sasanka R, c Feng W (2016) Bridging the FPGA programmability-portability gap via automatic OpenCL code generation and tuning. In: 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp 213–218
Kulkarni M, Burtscher M, Cascaval C, Pingali K (2009) Lonestar: a suite of parallel irregular programs. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 65–76
Lederer E (2014) Cross-device NBody simulation sample. https://software.intel.com/en-us/articles/opencl-cross-devices-nbody-simulation-sample. Accessed 17 June 2019
Li Z, Liu L, Deng Y, Yin S, Wang Y, Wei S (2017) Aggressive pipelining of irregular applications on reconfigurable hardware. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp 575–586
McIntosh-Smith S, Price J, Sessions RB, Ibarra AA (2015) High performance in silico virtual drug screening on many-core processors. Int J High Perform Comput Appl 29(2):119–134
Google Scholar
Navarro A, Corbera F, Rodriguez A, Vilches A, Asenjo R (2019) Heterogeneous parallel\_for template for CPU-GPU chips. Int J Parallel Programm 47:213–233
Google Scholar
Navarro A, Vilches A, Corbera F, Asenjo R (2014) Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J Supercomput 70(2):756–771
Google Scholar
Nunez-Yanez J, Amiri S, Hosseinabady M, Rodríguez A, Asenjo R, Navarro A, Suarez D, Gran R (2018) Simultaneous multiprocessing in a software-defined heterogeneous FPGA. The J Supercomput. https://doi.org/10.1007/s11227-018-2367-9
Google Scholar
Oguntebi T, Olukotun K (2016) Graphops: a dataflow library for graph analytics acceleration. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp 111–117. ACM
Prabhakar R, Koeplinger D, Brown KJ, Lee H, De Sa C, Kozyrakis C, Olukotun K (2016) Generating configurable hardware from parallel patterns. SIGOPS Oper Syst Rev 50(2):651–665. https://doi.org/10.1145/2954680.2872415
Google Scholar
Remis L, Garzarán MJ, Asenjo R, Navarro AG (2018) Exploiting social network graph characteristics for efficient BFS on heterogeneous chips. J Parallel Distrib Comput 120:282–294. https://doi.org/10.1016/j.jpdc.2017.11.003
Google Scholar
Rudolph D, Polychronopoulos C (1989) An efficient message-passing scheduler based on guided self scheduling. In: Proceedings of the 3rd International Conference on Supercomputing, ICS’89
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/wp/wp-01173-opencl.pdf. Accessed 17 June 2019
Sun Y, Gong X, Ziabari AK, Yu L, Li X, Mukherjee S, Mccardwell C, Villegas A, Kaeli D (2016) Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In: Intl. Symp. on Workload Characterization (IISWC), pp 1–10
Umuroglu Y, Morrison D, Jahre M (2015) Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. Int Conf Field Programm Log Appl. https://doi.org/10.1109/FPL.2015.7293939
Vilches A, Asenjo R, Navarro A, Corbera F, Gran R, Garzaran MJ (2015) Adaptive partitioning for irregular applications on heterogeneous CPU-GPU chips. Procedia Comput Sci 51:140–149
Google Scholar
Wang Z, He B, Zhang W, Jiang S (2016) A performance analysis framework for optimizing OpenCL applications on FPGAs. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 114–125
Windh S, Ma X, Halstead RJ, Budhkar P, Luna Z, Hussaini O, Najjar WA (2015) High-level language tools for reconfigurable computing. Proc IEEE 103(3):390–408. https://doi.org/10.1109/JPROC.2015.2399275
Google Scholar
Zhou S, Prasanna VK (2017) Accelerating graph analytics on CPU-FPGA heterogeneous platform. In: 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp 137–144. https://doi.org/10.1109/SBAC-PAD.2017.25

Download references

Acknowledgements

The authors would like to thank Intel-Altera for the opportunity to be part of the HARP program. This work was partially supported by the Spanish Projects TIN 2016-80920-R, TIN2016-76635-C2-1-R, gaZ: T48 research group, UNIZAR JIUZ-2017-TEC-09, UK EPSRC with the ENPOWER (EP/L00321X/1) and the ENEAC (EP/N002539/1) Projects.

Author information

Authors and Affiliations

Department of Computer Architecture, Universidad de Málaga, Andalucía Tech, Málaga, Spain
Andrés Rodríguez, Angeles Navarro, Rafael Asenjo & Francisco Corbera
Computer Architecture Group, Universidad de Zaragoza, Zaragoza, Spain
Rubén Gran & Darío Suárez
Department of Electrical and Electronic Engineering, University of Bristol, Bristol, UK
Jose Nunez-Yanez

Authors

Andrés Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Angeles Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Asenjo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Corbera
View author publications
You can also search for this author in PubMed Google Scholar
Rubén Gran
View author publications
You can also search for this author in PubMed Google Scholar
Darío Suárez
View author publications
You can also search for this author in PubMed Google Scholar
Jose Nunez-Yanez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Asenjo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodríguez, A., Navarro, A., Asenjo, R. et al. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. J Supercomput 76, 4645–4665 (2020). https://doi.org/10.1007/s11227-019-02935-1

Download citation

Published: 18 June 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11227-019-02935-1

Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simultaneous multiprocessing in a software-defined heterogeneous FPGA

Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware

On-Chip and Distributed Dynamic Parallelism for Task-based Hardware Accelerators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simultaneous multiprocessing in a software-defined heterogeneous FPGA

Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware

On-Chip and Distributed Dynamic Parallelism for Task-based Hardware Accelerators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now