Heterogeneous computing that exploits simultaneous co-processing with different device types has been shown to be effective at both increasing performance and reducing energy consumption. In this paper, we extend a scheduling framework encapsulated in a high-level C++ template and previously developed for heterogeneous chips comprising CPU and GPU cores, to new high-performance platforms for the data center, which include a cache coherent FPGA fabric and many-core CPU resources. Our goal is to evaluate the suitability of our framework with these new FPGA-based platforms, identifying performance benefits and limitations.We target the state-of-the-art HARP processor that includes 14 high-end Xeon classes tightly coupled to a FPGA device located in the same package. We select eight benchmarks from the high-performance computing domain that have been ported and optimized for this heterogeneous platform. The results show that a dynamic and adaptive scheduler that exploits simultaneous processing among the devices can improve performance up to a factor of 8 × compared to the best alternative solutions that only use the CPU cores or the FPGA fabric. Moreover, our proposal achieves up to 15% and 37% of improvement compared to the best heterogeneous solutions found with a dynamic and static schedulers, respectively.
A chunk is a block of consecutive iterations that are independent of other iterations or chunks of the parallel loop.
For the FPGA, the registered computation time includes the data transfer- or map/unmap- times.
nEU = clGetDeviceInfo(deviceId, CL_DEVICE_MAX_COMPUTE_UNITS).
It is not guaranteed that the stabilization criterion is always met. If the number of iterations is not large enough to fully utilize the FPGA, we may finish the computation without leaving the exploration phase.
In the OpenCL standard, the NDRange represents the 3D space of parallel iterations.
This is achieved by allocating the region with clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, size, ...) and then mapping this region to a CPU accessible pointer using clEnqueueMapBuffer().
The authors would like to thank Intel-Altera for the opportunity to be part of the HARP program. This work was partially supported by the Spanish Projects TIN 2016-80920-R, TIN2016-76635-C2-1-R, gaZ: T48 research group, UNIZAR JIUZ-2017-TEC-09, UK EPSRC with the ENPOWER (EP/L00321X/1) and the ENEAC (EP/N002539/1) Projects.
Rodríguez, A., Navarro, A., Asenjo, R. et al. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. J Supercomput 76, 4645–4665 (2020). https://doi.org/10.1007/s11227-019-02935-1
