research-article

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Authors:

Wenguang ChenAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 28, Issue 3

Pages 905 - 918

https://doi.org/10.1109/TPDS.2016.2586074

Published: 01 March 2017 Publication History

Abstract

Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port 42 programs in Rodinia, Parboil, and Polybench benchmark suites and analyze the co-running behaviors of these programs on both AMD and Intel integrated architectures. We find that co-running performance is not always better than running the program only with CPUs or GPUs. Among these programs, only eight programs can benefit from the co-running, while 24 programs only using GPUs and seven programs only using CPUs achieve the best performance. The remaining three programs show little performance preference for different devices. Through extensive workload characterization analysis, we find that architecture differences between CPUs and GPUs and limited shared memory bandwidth are two main factors affecting current co-running performance. Since not all the programs can benefit from integrated architectures, we build an automatic decision-tree-based model to help application developers predict the co-running performance for a given CPU-only or GPU-only program. Results show that our model correctly predicts 14 programs out of 15 for evaluated programs. For a co-run friendly program, we further propose a profiling-based method to predict the optimal workload partition ratio between CPUs and GPUs. Results show that our model can achieve 87.7 percent of the optimal performance relative to the best partition. The co-running programs acquired with our method outperform the original CPU-only and GPU-only programs by 34.5 and 20.9 percent respectively.

References

[1]

D. Foley, et al., “ AMD's 'Llano'Fusion APU,” in Hot Chips, vol. Volume 23, pp. 1–38, 2011.

Google Scholar

[2]

(2014). The Compute Architecture of Intel Processor Graphics Gen7.5. {Online}. Available: https://software.intel.com

Google Scholar

[3]

J. He, M. Lu, and B. He, “ Revisiting co-processing for hash joins on the coupled CPU-GPU architecture,” in Proc. VLDB Endowment, vol. Volume 6, no. Issue 10, pp. 889–900, 2013.

Digital Library

Google Scholar

[4]

J. He, S. Zhang, and B. He, “ In-cache query co-processing on coupled CPU-GPU architectures,” in Proc. VLDB Endowment, vol. Volume 8, no. Issue 4, pp. 329–340, 2014.

Digital Library

Google Scholar

[5]

M. C. Delorme, T. S. Abdelrahman, and C. Zhao, “ Parallel radix sort on the AMD fusion accelerated processing unit,” in Proc. 42nd Int. Conf. Parallel Process., 2013, pp. 339–348.

Digital Library

Google Scholar

[6]

M. Daga, M. Nutter, and M. Meswani, “ Efficient breadth-first search on a heterogeneous processor,” in Proc. IEEE Int. Conf. Big Data, 2014, pp. 373–382.

Crossref

Google Scholar

[7]

L. Chen, X. Huo, and G. Agrawal, “ Accelerating MapReduce on a coupled CPU-GPU architecture,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2012, pp. 25:1–25:11.

Digital Library

Google Scholar

[8]

P. Eberhart, I. Said, P. Fortin, and H. Calandra, “ Hybrid strategy for stencil computations on the APU,” in Proc. 1st Int. Workshop High-Perform. Stencil Comput., 2014, pp. 43–49.

Google Scholar

[9]

Q. Zhu, B. Wu, X. Shen, L. Shen, and Z. Wang, “ Understanding co-run degradations on integrated heterogeneous processors,” in Proc. 27th Int. Workshop Lang. Compil. Parallel Comput., 2014, pp. 82–97.

Google Scholar

[10]

S. Che, et al., “ Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization, 2009, pp. 44–54.

Digital Library

Google Scholar

[11]

J. A. Stratton, et al., “ Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center Reliable High-Perform. Comput., vol. Volume 127, pp. 1–12, 2012.

Google Scholar

[12]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “ Auto-tuning a high-level language targeted to GPU codes,” in Proc. Innovative Parallel Comput., 2012, pp. 1–10.

Google Scholar

[13]

J. E. Stone, D. Gohara, and G. Shi, “ OpenCL: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng., vol. Volume 12, no. Issue 3, pp. 66–73, 2010.

Digital Library

Google Scholar

[14]

D. Bouvier and B. Sander, “ Applying AMDs Kaveri APU for Heterogeneous Computing,” in Hot Chips, pp. 1–42, 2014.

Google Scholar

[15]

(2016). Intel HD and Iris Graphics. {Online}. Available: https://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics

Google Scholar

[16]

P. Winston, “ Learning by building identification trees,” in Artificial Intelligence . Boston, MA, USA: Addison-Wesley, 1992, pp. 423–442.

Google Scholar

[17]

(2015). Intel Forums. {Online}. Available: https://software.intel.com/en-us/comment/1827746#comment-1827746

Google Scholar

[18]

(2014). 4th Generation Intel Core i7 Processors. {Online}. Available: http://ark.intel.com

Google Scholar

[19]

D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang, Heterogeneous Computing with OpenCL 2.0 . Burlington, MA, USA: Morgan Kaufmann, 2015.

Digital Library

Google Scholar

[20]

S. Williams, A. Waterman, and D. Patterson, “ Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. Volume 52, no. Issue 4, pp. 65–76, 2009.

Digital Library

Google Scholar

[21]

J. R. Quinlan, C4.5: Programs for Machine Learning . Amsterdam, Netherlands: Elsevier, 2014.

Google Scholar

[22]

K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth, and J. S. Vetter, “ The tradeoffs of fused memory Hierarchies in heterogeneous computing architectures,” in Proc. 9th Conf. Comput. Frontiers, 2012, pp. 103–112.

Digital Library

Google Scholar

[23]

M. Daga, A. M. Aji, and W.-C. Feng, “ On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing,” in Proc. Symp. Appl. Accelerators High-Performance Comput., 2011, pp. 141–149.

Digital Library

Google Scholar

[24]

K. Lee, H. Lin, and W.-C. Feng, “ Performance characterization of data-intensive kernels on AMD fusion architectures,” Comput. Sci.-Res. Develop., vol. Volume 28, no. Issue 2/3, pp. 175–184, 2013.

Digital Library

Google Scholar

[25]

R. Barik, et al., “ Efficient mapping of irregular C++ applications to integrated GPUs,” in Proc. Annu. IEEE/ACM Int. Symp. Code Generation Optimization, 2014, pp. 33:33–33:43.

Digital Library

Google Scholar

[26]

R. Kaleem, R. Barik, T. Shpeisman, B. T. Lewis, C. Hu, and K. Pingali, “ Adaptive heterogeneous scheduling for integrated GPUs,” in Proc. 23rd Int. Conf. Parallel Archit. Compilat., 2014, pp. 151–162.

Digital Library

Google Scholar

[27]

F. Zhang, J. Zhai, W. Chen, B. He, and S. Zhang, “ To co-run, or not to co-run: A performance study on integrated architectures,” in Proc. IEEE 23rd Int. Symp. Model. Anal. Simulat. Comput. Telecommun. Syst., 2015, pp. 89–92.

Digital Library

Google Scholar

[28]

M. Doerksen, S. Solomon, and P. Thulasiraman, “ Designing APU oriented scientific computing applications in OpenCL,” in Proc. IEEE 13th Int. Conf. High Performance Comput. Commun., 2011, pp. 587–592.

Digital Library

Google Scholar

[29]

M. Daga and M. Nutter, “ Exploiting coarse-grained parallelism in B+ tree searches on an APU,” in Proc. SC Companion High Performance Comput. Netw. Storage Anal., 2012, pp. 240–247.

Digital Library

Google Scholar

[30]

T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt, “ Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems,” in Proc. IEEE Int. Symp. Performance Anal. Syst. Softw., 2012, pp. 88–98.

Digital Library

Google Scholar

[31]

J. Gu, et al., “ Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems,” in Proc. 5th Asia-Pacific Workshop Syst., 2014, pp. 12:1–12:7.

Digital Library

Google Scholar

[32]

S. Zhang, J. He, B. He, and M. Lu, “ OmniDB: Towards portable and efficient query processing on parallel CPU/GPU architectures,” in Proc. VLDB Endowment, vol. Volume 6, no. Issue 12, pp. 1374–1377, 2013.

Digital Library

Google Scholar

[33]

D. Grewe and M. F. OBoyle, “ A static task partitioning approach for heterogeneous systems using OpenCL,” in Proc. 20th Int. Conf. Compil. Construction, 2011, pp. 286–305.

Digital Library

Google Scholar

[34]

M. F. O'Boyle, Z. Wang, and D. Grewe, “ Portable mapping of data parallel programs to OpenCL for heterogeneous systems,” in Proc. IEEE/ACM Int. Symp. Code Generation Optimization, 2013, pp. 1–10.

Digital Library

Google Scholar

[35]

D. Wang, F. He, Y. Deng, C. Su, M. Gu, and J. Sun, “ Deadlock detection in FPGA design: A practical approach,” Tsinghua Sci. Tech., vol. Volume 20, no. Issue 2, pp. 212–218, 2015.

Google Scholar

[36]

D. Zhang, Y. Liu, S. Li, T. Wu, and H. Yang, “ Simultaneous accelerator parallelization and point-to-point interconnect insertion for bus-based embedded SoCs,” Tsinghua Sci. Technol., vol. Volume 20, no. Issue 6, pp. 644–660, 2015.

Crossref

Google Scholar

[37]

D. Grewe, “ Mapping parallel programs to heterogeneous multi-core systems,” in Informatics Thesis and Dissertation Collection, Edinburgh, U.K: <institution>Univ. Edinburgh</institution>, 2014.

Google Scholar

[38]

J. Power, et al., “ Heterogeneous system coherence for integrated CPU-GPU systems,” in Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 457–467.

Digital Library

Google Scholar

Cited By

View all

Martínez PBernabé GGarcía J(2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06008-w
Xia YZhang FXu QZhang MYao ZLu LDu XDeng DHe BMa S(2024)GPU-based butterfly countingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00861-033:5(1543-1567)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00861-0
Liu JZhang FGuan JSung HGuo XDu XShen XAamodt TJerger NSwift M(2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582062
Show More Cited By

Index Terms

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Recommendations

CPU-assisted GPGPU on fused CPU-GPU architectures
HPCA '12: Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture

This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip ...
On-the-fly workload partitioning for integrated CPU/GPU architectures
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Integrating CPUs and GPUs on the same die provides new opportunities for optimization, especially for irregular data-parallel workloads that fail to fully exploit the computational power of the GPU. Such workloads benefit from a proper partitioning ...
Affinity-aware work-stealing for integrated CPU-GPU processors
PPoPP '16

Recent integrated CPU-GPU processors like Intel's Broadwell and AMD's Kaveri support hardware CPU-GPU shared virtual memory, atomic operations, and memory coherency. This enables fine-grained CPU-GPU work-stealing, but architectural differences between ...

Reviews

Reviewer: Mohamed Zahran

Graphics processing units (GPUs) are used in many applications that are not necessarily related to graphics. With the rise of big data and machine learning, GPUs gain more importance. GPUs exist in two different "flavors": discrete (where the GPU card is plugged into your board through the peripheral component interconnect express (PCIe) bus) or embedded (in the same chip as the multicore processor). This paper discusses the latter. Both AMD and Intel have GPUs embedded with the multicore. The paper uses the AMD Kaveri processor and the Intel Haswell processor for the study at hand. This study tries to answer one question: If I have a parallel application to be executed on such a chip, shall I assign all of the threads to the GPU, all of the threads to the central processing unit (CPU), or use both at the same time The authors did a lot of experiments with programs drawn from several benchmark suites. Based on these experiments, they categorize each program in one of three categories: co-run friendly programs (those that can benefit from both the CPU and GPU simultaneously), GPU-dominant programs (those that get the best performance from GPU-only), and CPU-dominant (those that get the best performance from using CPU-only). There is a small fraction of programs that are ratio-oblivious; that is, they have similar performance no matter the ratio of threads running on GPU versus CPU. The paper then tries to understand the factor determining each category. Most programs were not co-run friendly. This is due to several factors. First, CPU and GPU have totally different architectures. For example, GPU has local memory that the programmer can make use of to reduce global memory access. The CPU does not have this memory and the programming libraries must emulate it, reducing performance. Second, CPUs require less memory bandwidth and get better performance from locality (that is, cache-friendly applications), while GPUs require much higher bandwidth due to the massive parallelism they need to hide global memory access. The co-run-friendly programs require limited bandwidth and do not have a very high degree of parallelism. The paper goes one step further by building a black-box machine prediction tool that predicts whether a program is co-run friendly or not. If the program is co-run friendly, the authors propose another analytical tool that determines the ratio of the runs on CPU versus GPU. Finally, the paper discusses an important topic: power/energy efficiency. The authors found that not all integrated architectures (that is, the ones with multicore plus GPU) are more energy-efficient than discrete ones. Overall, this is a good paper on an important topic, which provides good insights on the behavior of parallel programs and when they benefit from multicore versus GPUs. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 28, Issue 3

March 2017

313 pages

ISSN:1045-9219

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 March 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Martínez PBernabé GGarcía J(2024)POAS: a framework for exploiting accelerator level parallelism in heterogeneous environmentsThe Journal of Supercomputing10.1007/s11227-024-06008-w80:10(14666-14693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06008-w
Xia YZhang FXu QZhang MYao ZLu LDu XDeng DHe BMa S(2024)GPU-based butterfly countingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00861-033:5(1543-1567)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00861-0
Liu JZhang FGuan JSung HGuo XDu XShen XAamodt TJerger NSwift M(2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582062
Hu YZhang FXia YYao ZZeng LDing HWei ZZhang XZhai JDu XMa S(2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3294341
Zhang FHu YDing HYao ZWei ZZhang XDu XWolf FShende SCulhane CAlam SJagode H(2022)Optimizing random access to hierarchically-compressed data on GPUProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571908(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571908
Xu QZhang FYao ZLu LDu XDeng DHe B(2022)Efficient load-balanced butterfly counting on GPUProceedings of the VLDB Endowment10.14778/3551793.355180615:11(2450-2462)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551806
Fu ZRen JLiu YCao TZhang DZhou YZhang YGummeson JLee SGao JXing G(2022)HyperionProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568511(607-621)Online publication date: 6-Nov-2022
https://dl.acm.org/doi/10.1145/3560905.3568511
Kuhrt MKörber MSeeger B(2022)iGPU-Accelerated Pattern Matching on Event StreamsProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535099(1-7)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533737.3535099
Zheng ZYang XZhao PLong GZhu KZhu FZhao WLiu XYang JZhai JSong SLin WFalsafi BFerdman MLu SWenisch T(2022)AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architecturesProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507723(359-373)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507723
Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Show More Cited By

Abstract

References

Cited By

Index Terms

Recommendations

CPU-assisted GPGPU on fused CPU-GPU architectures

On-the-fly workload partitioning for integrated CPU/GPU architectures

Affinity-aware work-stealing for integrated CPU-GPU processors

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations