[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Published: 01 March 2017 Publication History

Abstract

Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port 42 programs in Rodinia, Parboil, and Polybench benchmark suites and analyze the co-running behaviors of these programs on both AMD and Intel integrated architectures. We find that co-running performance is not always better than running the program only with CPUs or GPUs. Among these programs, only eight programs can benefit from the co-running, while 24 programs only using GPUs and seven programs only using CPUs achieve the best performance. The remaining three programs show little performance preference for different devices. Through extensive workload characterization analysis, we find that architecture differences between CPUs and GPUs and limited shared memory bandwidth are two main factors affecting current co-running performance. Since not all the programs can benefit from integrated architectures, we build an automatic decision-tree-based model to help application developers predict the co-running performance for a given CPU-only or GPU-only program. Results show that our model correctly predicts 14 programs out of 15 for evaluated programs. For a co-run friendly program, we further propose a profiling-based method to predict the optimal workload partition ratio between CPUs and GPUs. Results show that our model can achieve 87.7 percent of the optimal performance relative to the best partition. The co-running programs acquired with our method outperform the original CPU-only and GPU-only programs by 34.5 and 20.9 percent respectively.

References

[1]
D. Foley, et al., “ AMD's 'Llano'Fusion APU,” in Hot Chips, vol. Volume 23, pp. 1–38, 2011.
[2]
(2014). The Compute Architecture of Intel Processor Graphics Gen7.5. {Online}. Available: https://software.intel.com
[3]
J. He, M. Lu, and B. He, “ Revisiting co-processing for hash joins on the coupled CPU-GPU architecture,” in Proc. VLDB Endowment, vol. Volume 6, no. Issue 10, pp. 889–900, 2013.
[4]
J. He, S. Zhang, and B. He, “ In-cache query co-processing on coupled CPU-GPU architectures,” in Proc. VLDB Endowment, vol. Volume 8, no. Issue 4, pp. 329–340, 2014.
[5]
M. C. Delorme, T. S. Abdelrahman, and C. Zhao, “ Parallel radix sort on the AMD fusion accelerated processing unit,” in Proc. 42nd Int. Conf. Parallel Process., 2013, pp. 339–348.
[6]
M. Daga, M. Nutter, and M. Meswani, “ Efficient breadth-first search on a heterogeneous processor,” in Proc. IEEE Int. Conf. Big Data, 2014, pp. 373–382.
[7]
L. Chen, X. Huo, and G. Agrawal, “ Accelerating MapReduce on a coupled CPU-GPU architecture,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2012, pp. 25:1–25:11.
[8]
P. Eberhart, I. Said, P. Fortin, and H. Calandra, “ Hybrid strategy for stencil computations on the APU,” in Proc. 1st Int. Workshop High-Perform. Stencil Comput., 2014, pp. 43–49.
[9]
Q. Zhu, B. Wu, X. Shen, L. Shen, and Z. Wang, “ Understanding co-run degradations on integrated heterogeneous processors,” in Proc. 27th Int. Workshop Lang. Compil. Parallel Comput., 2014, pp. 82–97.
[10]
S. Che, et al., “ Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization, 2009, pp. 44–54.
[11]
J. A. Stratton, et al., “ Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center Reliable High-Perform. Comput., vol. Volume 127, pp. 1–12, 2012.
[12]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “ Auto-tuning a high-level language targeted to GPU codes,” in Proc. Innovative Parallel Comput., 2012, pp. 1–10.
[13]
J. E. Stone, D. Gohara, and G. Shi, “ OpenCL: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng., vol. Volume 12, no. Issue 3, pp. 66–73, 2010.
[14]
D. Bouvier and B. Sander, “ Applying AMDs Kaveri APU for Heterogeneous Computing,” in Hot Chips, pp. 1–42, 2014.
[15]
(2016). Intel HD and Iris Graphics. {Online}. Available: https://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics
[16]
P. Winston, “ Learning by building identification trees,” in Artificial Intelligence . Boston, MA, USA: Addison-Wesley, 1992, pp. 423–442.
[17]
(2015). Intel Forums. {Online}. Available: https://software.intel.com/en-us/comment/1827746#comment-1827746
[18]
(2014). 4th Generation Intel Core i7 Processors. {Online}. Available: http://ark.intel.com
[19]
D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang, Heterogeneous Computing with OpenCL 2.0 . Burlington, MA, USA: Morgan Kaufmann, 2015.
[20]
S. Williams, A. Waterman, and D. Patterson, “ Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. Volume 52, no. Issue 4, pp. 65–76, 2009.
[21]
J. R. Quinlan, C4.5: Programs for Machine Learning . Amsterdam, Netherlands: Elsevier, 2014.
[22]
K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth, and J. S. Vetter, “ The tradeoffs of fused memory Hierarchies in heterogeneous computing architectures,” in Proc. 9th Conf. Comput. Frontiers, 2012, pp. 103–112.
[23]
M. Daga, A. M. Aji, and W.-C. Feng, “ On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing,” in Proc. Symp. Appl. Accelerators High-Performance Comput., 2011, pp. 141–149.
[24]
K. Lee, H. Lin, and W.-C. Feng, “ Performance characterization of data-intensive kernels on AMD fusion architectures,” Comput. Sci.-Res. Develop., vol. Volume 28, no. Issue 2/3, pp. 175–184, 2013.
[25]
R. Barik, et al., “ Efficient mapping of irregular C++ applications to integrated GPUs,” in Proc. Annu. IEEE/ACM Int. Symp. Code Generation Optimization, 2014, pp. 33:33–33:43.
[26]
R. Kaleem, R. Barik, T. Shpeisman, B. T. Lewis, C. Hu, and K. Pingali, “ Adaptive heterogeneous scheduling for integrated GPUs,” in Proc. 23rd Int. Conf. Parallel Archit. Compilat., 2014, pp. 151–162.
[27]
F. Zhang, J. Zhai, W. Chen, B. He, and S. Zhang, “ To co-run, or not to co-run: A performance study on integrated architectures,” in Proc. IEEE 23rd Int. Symp. Model. Anal. Simulat. Comput. Telecommun. Syst., 2015, pp. 89–92.
[28]
M. Doerksen, S. Solomon, and P. Thulasiraman, “ Designing APU oriented scientific computing applications in OpenCL,” in Proc. IEEE 13th Int. Conf. High Performance Comput. Commun., 2011, pp. 587–592.
[29]
M. Daga and M. Nutter, “ Exploiting coarse-grained parallelism in B+ tree searches on an APU,” in Proc. SC Companion High Performance Comput. Netw. Storage Anal., 2012, pp. 240–247.
[30]
T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt, “ Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems,” in Proc. IEEE Int. Symp. Performance Anal. Syst. Softw., 2012, pp. 88–98.
[31]
J. Gu, et al., “ Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems,” in Proc. 5th Asia-Pacific Workshop Syst., 2014, pp. 12:1–12:7.
[32]
S. Zhang, J. He, B. He, and M. Lu, “ OmniDB: Towards portable and efficient query processing on parallel CPU/GPU architectures,” in Proc. VLDB Endowment, vol. Volume 6, no. Issue 12, pp. 1374–1377, 2013.
[33]
D. Grewe and M. F. OBoyle, “ A static task partitioning approach for heterogeneous systems using OpenCL,” in Proc. 20th Int. Conf. Compil. Construction, 2011, pp. 286–305.
[34]
M. F. O'Boyle, Z. Wang, and D. Grewe, “ Portable mapping of data parallel programs to OpenCL for heterogeneous systems,” in Proc. IEEE/ACM Int. Symp. Code Generation Optimization, 2013, pp. 1–10.
[35]
D. Wang, F. He, Y. Deng, C. Su, M. Gu, and J. Sun, “ Deadlock detection in FPGA design: A practical approach,” Tsinghua Sci. Tech., vol. Volume 20, no. Issue 2, pp. 212–218, 2015.
[36]
D. Zhang, Y. Liu, S. Li, T. Wu, and H. Yang, “ Simultaneous accelerator parallelization and point-to-point interconnect insertion for bus-based embedded SoCs,” Tsinghua Sci. Technol., vol. Volume 20, no. Issue 6, pp. 644–660, 2015.
[37]
D. Grewe, “ Mapping parallel programs to heterogeneous multi-core systems,” in Informatics Thesis and Dissertation Collection, Edinburgh, U.K: <institution>Univ. Edinburgh</institution>, 2014.
[38]
J. Power, et al., “ Heterogeneous system coherence for integrated CPU-GPU systems,” in Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 457–467.

Cited By

View all
  • (2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
  • (2024)GPU-based butterfly countingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00861-033:5(1543-1567)Online publication date: 1-Sep-2024
  • (2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
  • Show More Cited By

Recommendations

Reviews

Mohamed Zahran

Graphics processing units (GPUs) are used in many applications that are not necessarily related to graphics. With the rise of big data and machine learning, GPUs gain more importance. GPUs exist in two different "flavors": discrete (where the GPU card is plugged into your board through the peripheral component interconnect express (PCIe) bus) or embedded (in the same chip as the multicore processor). This paper discusses the latter. Both AMD and Intel have GPUs embedded with the multicore. The paper uses the AMD Kaveri processor and the Intel Haswell processor for the study at hand. This study tries to answer one question: If I have a parallel application to be executed on such a chip, shall I assign all of the threads to the GPU, all of the threads to the central processing unit (CPU), or use both at the same time The authors did a lot of experiments with programs drawn from several benchmark suites. Based on these experiments, they categorize each program in one of three categories: co-run friendly programs (those that can benefit from both the CPU and GPU simultaneously), GPU-dominant programs (those that get the best performance from GPU-only), and CPU-dominant (those that get the best performance from using CPU-only). There is a small fraction of programs that are ratio-oblivious; that is, they have similar performance no matter the ratio of threads running on GPU versus CPU. The paper then tries to understand the factor determining each category. Most programs were not co-run friendly. This is due to several factors. First, CPU and GPU have totally different architectures. For example, GPU has local memory that the programmer can make use of to reduce global memory access. The CPU does not have this memory and the programming libraries must emulate it, reducing performance. Second, CPUs require less memory bandwidth and get better performance from locality (that is, cache-friendly applications), while GPUs require much higher bandwidth due to the massive parallelism they need to hide global memory access. The co-run-friendly programs require limited bandwidth and do not have a very high degree of parallelism. The paper goes one step further by building a black-box machine prediction tool that predicts whether a program is co-run friendly or not. If the program is co-run friendly, the authors propose another analytical tool that determines the ratio of the runs on CPU versus GPU. Finally, the paper discusses an important topic: power/energy efficiency. The authors found that not all integrated architectures (that is, the ones with multicore plus GPU) are more energy-efficient than discrete ones. Overall, this is a good paper on an important topic, which provides good insights on the behavior of parallel programs and when they benefit from multicore versus GPUs. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 28, Issue 3
March 2017
313 pages

Publisher

IEEE Press

Publication History

Published: 01 March 2017

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
  • (2024)GPU-based butterfly countingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00861-033:5(1543-1567)Online publication date: 1-Sep-2024
  • (2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
  • (2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
  • (2022)Optimizing random access to hierarchically-compressed data on GPUProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571908(1-15)Online publication date: 13-Nov-2022
  • (2022)Efficient load-balanced butterfly counting on GPUProceedings of the VLDB Endowment10.14778/3551793.355180615:11(2450-2462)Online publication date: 29-Sep-2022
  • (2022)HyperionProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568511(607-621)Online publication date: 6-Nov-2022
  • (2022)iGPU-Accelerated Pattern Matching on Event StreamsProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535099(1-7)Online publication date: 12-Jun-2022
  • (2022)AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architecturesProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507723(359-373)Online publication date: 28-Feb-2022
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media