Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions

Qi Zhu^1,2,
Bo Wu³,
Xipeng Shen²,
Kai Shen⁴,
Li Shen¹ &
…
Zhiying Wang¹

212 Accesses
4 Citations
Explore all metrics

Abstract

Recent years have witnessed a processor development trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The integration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep resource sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor architectures, operating systems, benchmarks, timing mechanisms, inputs, and power management schemes. These measurements reveal a number of surprising observations.We analyze these observations and produce a list of novel insights, including the important roles of operating system (OS) context switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heterogeneous processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Understanding Data Partition for Applications on CPU-GPU Integrated Processors

Operational Concepts of GPU Systems in HPC Centers: TCO and Productivity

Heterogeneous parallel_for Template for CPU–GPU Chips

Article 31 January 2018

References

Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400
Article Google Scholar
Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143
Article Google Scholar
Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, Hwu W M W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358
Chapter Google Scholar
Jiang Y, Shen X P, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229
Chapter Google Scholar
Tian K, Jiang Y L, Shen X P. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Computing Frontiers. 2009, 41–50
Google Scholar
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38
Google Scholar
El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium. 2006
Google Scholar
Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316
Chapter Google Scholar
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Annual Technical Conference. 2011, 17–30
Google Scholar
Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234
Google Scholar
Tuck N, Tullsen D M. Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 2003, 26–35
Google Scholar
Fousek J, Filipovic J, Madzin M. Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Computer Architecture News, 2011, 39(4): 98–99
Article Google Scholar
Wang G B, Lin Y S, Yi W. Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing (CPSCom). 2010, 344–350
Google Scholar
Wu H C, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S. Optimizing data warehousing applications for GPUs using kernel fusion /fission. In: Proceedings of Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 2012, 2433–2442
Google Scholar
Aila T, Laine S. Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics. 2009, 145–149
Chapter Google Scholar
Chen L, Villa O, Krishnamoorthy S, Gao G R. Dynamic load balancing on single- and multi-GPU systems. In: Proceedings of 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). 2010, 1–12
Chapter Google Scholar
Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. In: Proceedings of Innovative Parallel Computing (InPar), 2012. 2012, 1–14
Chapter Google Scholar
Xiao S C, Feng W C. Inter-block GPU communication via fast barrier synchronization. In: Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Proceedings. 2010, 1–12
Google Scholar
Li C P, Ding C, Shen K. Quantifying the cost of context switch. In: Proceedings of the 2007 Workshop on Experimental Computer Science. 2007, 2
Chapter Google Scholar
Muralidhara S P, Subramanian L, Mutlu O, Kandemir M, Moscibroda T. Reducing memory interference in multicore systems via applicationaware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACMInternational Symposium onMicroarchitecture. 2011, 374–385
Google Scholar
Liu L, Li Y, Cui Z H, Bao Y G, Chen M Y, Wu C Y. Going vertical in memory management: handling multiplicity by multi-policy. In: Proceedings of the 2014 ACM/IEEE 41st International Symposium on Computer Architecture. 2014, 169–180
Chapter Google Scholar
Lin J, Lu Q D, Ding X N, Zhang Z, Zhang X D, Sadayappan P. Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture. 2008, 367–378
Google Scholar
Liu L, Cui Z H, Xing M J, Bao Y G, Chen M Y, Wu C Y. A software memory partition approach for eliminating bank-level interference in multicore systems. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 2012, 367–376
Chapter Google Scholar
Chang J C, Sohi G S. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 25th Annual International Conference on Supercomputing. 2007, 242–252
Chapter Google Scholar
Rafique N, Lim W T, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12
Chapter Google Scholar
Suh G E, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128
Chapter Google Scholar
Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 423–432
Google Scholar
Zhang E Z, Jiang Y L, Shen X P. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212
Google Scholar
Kim Y, Papamichael M, Mutlu O, Harchol-Balter M. Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). 2010, 65–76
Google Scholar
Cook H, Moreto M, Bird S, Dao K, Patterson D A, Asanovic K. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. ACM SIGARCH Computer Architecture News, 2013, 41(3): 308–319
Article Google Scholar
Xie Y J, Loh G H. Scalable shared-cache management by containing thrashing workloads. In: Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers. 2010, 262–276
Chapter Google Scholar
Mars J, Tang L J. Whare-Map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture (ISCA). 2013, 1–12
Google Scholar
Zahedi S M, Lee B C. REF: resource elasticity fairness with sharing incentives for multiprocessors. ACM SIGPLAN Notices, 2014, 49(4): 145–160
Google Scholar
Ausavarungnirun R, Chang K KW, Subramanian L, Loh G H, Mutlu O. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. ACM SIGARCH Computer Architecture News, 2012, 40(3): 416–427
Article Google Scholar
Zhu Q, Wu B, Shen X P, Shen L, Wang Z Y. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing (LCPC). 2014, 82–97
Google Scholar

Download references

Acknowledgements

We thank the constructive comments from the anonymous referees. This material was based upon work supported by DOE Early Career Award, the National Science Foundation (NSF) (1455404 and 1525609), and NSF CAREER Award. This work is also supported partly by the NSF (CNS-1217372, CNS-1239423, CCF-1255729, CNS-1319353, and CNS-1319417) and the National Natural Science Foundation of China (NSFC) (Grant Nos. 61272143, 61272144, and 61472431). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE, NSF, or NSFC.

Author information

Authors and Affiliations

National Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China
Qi Zhu, Li Shen & Zhiying Wang
Department of Computer Science, North Carolina State University, Raleigh, NC, 27695, USA
Qi Zhu & Xipeng Shen
Department of Electrical Engineering & Computer Science, Colorado School of Mines, Golden, CO, 80401, USA
Bo Wu
Department of Computer Science, University of Rochester, Rochester, NY, 14627, USA
Kai Shen

Authors

Qi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xipeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Shen
View author publications
You can also search for this author in PubMed Google Scholar
Li Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Zhu.

Additional information

Qi Zhu, a doctoral candidate, received the MS degree in computer science from National University of Defense Technology, China in 2012. His research interests include compilers and programming systems, heterogeneous computing and emerging architectures.

Bo Wu, an assistant professor at Colorado School of Mines, USA, earned a PhD in computer science from the College of William and Mary, USA. His research lies in the broad field of compilers and programming systems, with an emphasis on program optimizations for heterogeneous computing and emerging architectures. His current focus is on high-performance graph analytics on GPUs and memory optimization for irregular applications.

Xipeng Shen is an associate professor at the Computer Science Department, North Carolina State University (NCSU), USA. He has been an IBM Canada Center for Advanced Studies (CAS) Research Faculty Fellow since 2010, and a receipt of the 2011 DOE Early Career Award and 2010 National Science Foundation (NSF) CAREER Award. His research lies in the broad field of programming systems, with an emphasis on enabling extreme-scale data-intensive computing and intelligent portable computing through innovations in both compilers and runtime systems. He has been particularly interested in capturing large-scale program behavior patterns, in both data accesses and code executions, and exploiting them for scalable and efficient computing in a heterogeneous, massively parallel environment. He leads the North Carolina State University-Compiler and Adaptive Programming Systems (NC-CAPS) research group.

Kai Shen is an associate professor at the Department of Computer Science, University of Rochester, USA. His research interests fall into the broad area of computer systems. Much of his work is driven by the complexity of modern computer systems and the need for principled approaches to understand, characterize, and manage such complexities. His is particularly interested in the cross-layer work of developing software system solution to support emerging hardware or address hardware issues, including the characterization and management of memory hardware errors, system support for flash-based SSDs and GPUs, as well as cyber-physical systems.

Li Shen received the BS, Master and PhD degrees in computer science and technology from the National University of Defense Technology (NUDT), China in 1997, 2000, and 2003, respectively. Currently, he is an associate professor of the School of Computer, NUDT. His research interests include programming model and compiler design, high performance processor architecture, virtualization technologies, and performance evaluation and workload characterization. He is a member of CCF and ACM.

Zhiying Wang received the PhD degree in electrical engineering from the National University of Defense Technology (NUDT), China in 1998. Currently, he is a professor of the School of Computer, NUDT. He has contributed over 10 invited chapters to book volumes, published 240 papers in archival journals and refereed conference proceedings, and delivered over 30 keynotes. His main research fields include computer architecture, computer security, very large scale integration (VLSI) design, reliable architecture, multicore memory system, and asynchronous circuit. He is a member of the CCF, ACM, and IEEE.

Electronic supplementary material

Supplementary material, approximately 499 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Q., Wu, B., Shen, X. et al. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions. Front. Comput. Sci. 11, 130–146 (2017). https://doi.org/10.1007/s11704-016-5468-8

Download citation

Received: 05 November 2015
Accepted: 29 February 2016
Published: 07 April 2017
Issue Date: February 2017
DOI: https://doi.org/10.1007/s11704-016-5468-8

Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding Data Partition for Applications on CPU-GPU Integrated Processors

Operational Concepts of GPU Systems in HPC Centers: TCO and Productivity

Heterogeneous parallel_for Template for CPU–GPU Chips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 499 KB.

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding Data Partition for Applications on CPU-GPU Integrated Processors

Operational Concepts of GPU Systems in HPC Centers: TCO and Productivity

Heterogeneous parallel_for Template for CPU–GPU Chips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 499 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation