More Web Proxy on the site http://driver.im/

research-article

Enabling preemptive multiprogramming on GPUs

Authors:

Javier Cabezas,

Mateo ValeroAuthors Info & Claims

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

Pages 193 - 204

Published: 14 June 2014 Publication History

Abstract

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.

In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

References

[1]

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1--12.

Digital Library

[2]

T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009. ACM, 2009, pp. 145--149.

Digital Library

[3]

AMD, "AMD A-Series Processor-in-a-Box," 2012. {Online}. Available: http://www.amd.com/us/products/desktop/processors/a-series/ Pages/a-series-pib.aspx

[4]

AMD, "AMD Graphics Cores Next (GCN) architecture white paper," 2012.

[5]

ARM, "ARM Mali," 2012. {Online}. Available: www.arm.com/ products/multimedia/mali-graphics-plus-gpu-compute

[6]

C. Basaran and K.-D. Kang, "Supporting preemptive task executions and memory copies in GPGPUs," in Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 287--296.

Digital Library

[7]

M. Bautin, A. Dwarakinath, and T. Chiueh, "Graphic engine resource management," in SPIE 2008, vol. 6818, 2008, p. 68180O.

[8]

A. Branover, D. Foley, and M. Steinman, "AMD Fusion APU: Llano," Micro, IEEE, vol. 32, no. 2, pp. 28--37, 2012.

Digital Library

[9]

Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero, "Task superscalar: An out-of-order task pipeline," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 89--100.

Digital Library

[10]

S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," Micro, IEEE, vol. 28, no. 3, pp. 42--53, 2008.

Digital Library

[11]

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 407--420.

Digital Library

[12]

C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, "Fine-grained resource sharing for concurrent GPGPU kernels," in Proceedings of the 4th USENIX conference on Hot Topics in Parallelism. USENIX Association, 2012, pp. 10--10.

Digital Library

[13]

M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler," in Workshop on Programming Models for Emerging Architectures, 2009, pp. 69--76.

[14]

K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.

[15]

Intel, "4th generation Intel Core processors are here," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/processors/core/ 4th-gen-core-processor-family.html

[16]

S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar, "RGEM: A responsive GPGPU execution model for runtime engines," in Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd. IEEE, 2011, pp. 57--66.

Digital Library

[17]

S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, "Time- Graph: GPU scheduling for real-time multi-tasking environments," in 2011 USENIX Annual Technical Conference (USENIX ATC'11), 2011, p. 17.

Digital Library

[18]

S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class GPU resource management in the operating system," in USENIX ATC, vol. 12, 2012, pp. 37--37.

Digital Library

[19]

G. Kyriazis, "Heterogenious System Architecture: a technical review," AMD, 2012.

[20]

T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 733--742.

Digital Library

[21]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," Micro, IEEE, vol. 28, no. 2, pp. 39--55, 2008.

Digital Library

[22]

J. Menon, M. De Kruijf, and K. Sankaralingam, "igpu: Exception support and speculative execution on gpus," in Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 2012, pp. 72--83.

Digital Library

[23]

NVIDIA, "Next generation CUDA computer architecture Kepler GK110," 2012.

[24]

NVIDIA, "Sharing a GPU between MPI processes: multi-process service (MPS) overview," 2013.

[25]

NVIDIA, "Programming guide - CUDA toolkit documentation," 2014. {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[26]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.

[27]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU concurrency with elastic kernels," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. ACM, 2013, pp. 407--418.

Digital Library

[28]

B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2014, pp. 743--758.

Digital Library

[29]

V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 217--228.

Digital Library

[30]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: operating system abstractions to manage GPUs as compute devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233--248.

Digital Library

[31]

Samsung, "Samsung Exynos," 2012. {Online}. Available: www. samsung.com/exynos

[32]

J. E. Smith and A. R. Pleszkun, "Implementation of precise interrupts in pipelined processors," in Proceedings of the 12th annual International Symposium on Computer Architecture, ser. ISCA '85, 1985, pp. 36--44.

Digital Library

[33]

M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: dynamic scheduling on GPUs," ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 161, 2012.

Digital Library

[34]

J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, G. Liu, and W. Hwu, "The Parboil benchmarks," Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Tech. Rep., 2012.

[35]

J. Stratton, S. Stone, and W.-m. Hwu, "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs," LCPC 2008, pp. 16--30, 2008.

Digital Library

[36]

N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in Proceedings of 12th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT 2003. IEEE, 2003, pp. 26--34.

Digital Library

[37]

J. Vera, F. J. Cazorla, A. Pajuelo, O. J. Santana, E. Fernandez, and M. Valero, "FAME: Fairly measuring multithreaded architectures," in Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. IEEE, 2007, pp. 305--316.

Digital Library

[38]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU architecture," Micro, IEEE, vol. 31, no. 2, pp. 50--59, 2011.

Digital Library

[39]

J. Zhong and B. He, "Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling," arXiv preprint arXiv:1303.5164, 2013.

Cited By

Xu YHe TSun RMa YJin YZou AMitra TYoung EXiong J(2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549409
Plano TBuhler J(2021)Enabling Real-Time Irregular Data-Flow Pipelines on SIMD Devices50th International Conference on Parallel Processing Workshop10.1145/3458744.3473367(1-8)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3458744.3473367
Hunt TJia ZMiller VSzekely AHu YRossbach CWitchel EBhagwan RPorter G(2020)TelekineProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388301(817-834)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388301
Show More Cited By

Enabling preemptive multiprogramming on GPUs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management

Recommendations

Enabling preemptive multiprogramming on GPUs
ISCA '14

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide ...
Brook for GPUs: stream computing on graphics hardware
SIGGRAPH '04: ACM SIGGRAPH 2004 Papers

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a ...
A compiler and runtime system for enabling data mining applications on gpus
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

June 2014

566 pages

ISBN:9781479943944

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Antonia Zhai
University of Minnesota
,
Program Chair:
Steve Keckler
NVIDIA/University of Texas at Austin

ACM SIGARCH Computer Architecture News Volume 42, Issue 3
ISCA '14
June 2014
552 pages
ISSN:0163-5964
DOI:10.1145/2678373
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 14 June 2014

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ISCA'14

Sponsor:

IEEE TCCA
SIGARCH

ISCA'14: The 41st Annual International Symposium on Computer Architecture

June 14 - 18, 2014

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
934
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)11

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu YHe TSun RMa YJin YZou AMitra TYoung EXiong J(2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549409
Plano TBuhler J(2021)Enabling Real-Time Irregular Data-Flow Pipelines on SIMD Devices50th International Conference on Parallel Processing Workshop10.1145/3458744.3473367(1-8)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3458744.3473367
Hunt TJia ZMiller VSzekely AHu YRossbach CWitchel EBhagwan RPorter G(2020)TelekineProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388301(817-834)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388301
Yeh TSabne ASakdhnagool PEigenmann RRogers T(2019)PagodaACM Transactions on Parallel Computing10.1145/33656576:4(1-23)Online publication date: 19-Nov-2019
https://dl.acm.org/doi/10.1145/3365657
Oh YKoo GAnnavaram MRo WManne SHunter HAltman E(2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322222
Fang ZHong DGupta RZink MToni LBegen A(2019)Serving deep neural networks at the cloud edge for vision applications on mobile platformsProceedings of the 10th ACM Multimedia Systems Conference10.1145/3304109.3306221(36-47)Online publication date: 18-Jun-2019
https://dl.acm.org/doi/10.1145/3304109.3306221
Volos SVaswani KBruno RArpaci-Dusseau AVoelker G(2018)GravitonProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291219(681-696)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291219
Tsukada KIno F(2018)A Method for Estimating Task Granularity for Automating GPU Cycle SharingProceedings of the 2018 VII International Conference on Network, Communication and Computing10.1145/3301326.3301386(133-139)Online publication date: 14-Dec-2018
https://dl.acm.org/doi/10.1145/3301326.3301386
Ausavarungnirun RMiller VLandgraf JGhose SGandhi JJog ARossbach CMutlu O(2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173169
Chiu MYou Y(2018)Enabling OpenCL Preemptive Multitasking Using Software CheckpointingWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229725(1-7)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229725
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents