[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2665671.2665702acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Enabling preemptive multiprogramming on GPUs

Published: 14 June 2014 Publication History

Abstract

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.
In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

References

[1]
J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1--12.
[2]
T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009. ACM, 2009, pp. 145--149.
[3]
AMD, "AMD A-Series Processor-in-a-Box," 2012. {Online}. Available: http://www.amd.com/us/products/desktop/processors/a-series/ Pages/a-series-pib.aspx
[4]
AMD, "AMD Graphics Cores Next (GCN) architecture white paper," 2012.
[5]
ARM, "ARM Mali," 2012. {Online}. Available: www.arm.com/ products/multimedia/mali-graphics-plus-gpu-compute
[6]
C. Basaran and K.-D. Kang, "Supporting preemptive task executions and memory copies in GPGPUs," in Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on. IEEE, 2012, pp. 287--296.
[7]
M. Bautin, A. Dwarakinath, and T. Chiueh, "Graphic engine resource management," in SPIE 2008, vol. 6818, 2008, p. 68180O.
[8]
A. Branover, D. Foley, and M. Steinman, "AMD Fusion APU: Llano," Micro, IEEE, vol. 32, no. 2, pp. 28--37, 2012.
[9]
Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero, "Task superscalar: An out-of-order task pipeline," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 89--100.
[10]
S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," Micro, IEEE, vol. 28, no. 3, pp. 42--53, 2008.
[11]
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 407--420.
[12]
C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, "Fine-grained resource sharing for concurrent GPGPU kernels," in Proceedings of the 4th USENIX conference on Hot Topics in Parallelism. USENIX Association, 2012, pp. 10--10.
[13]
M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, "Enabling task parallelism in the CUDA scheduler," in Workshop on Programming Models for Emerging Architectures, 2009, pp. 69--76.
[14]
K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style GPU programming for GPGPU workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.
[15]
Intel, "4th generation Intel Core processors are here," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/processors/core/ 4th-gen-core-processor-family.html
[16]
S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar, "RGEM: A responsive GPGPU execution model for runtime engines," in Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd. IEEE, 2011, pp. 57--66.
[17]
S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, "Time- Graph: GPU scheduling for real-time multi-tasking environments," in 2011 USENIX Annual Technical Conference (USENIX ATC'11), 2011, p. 17.
[18]
S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class GPU resource management in the operating system," in USENIX ATC, vol. 12, 2012, pp. 37--37.
[19]
G. Kyriazis, "Heterogenious System Architecture: a technical review," AMD, 2012.
[20]
T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp. 733--742.
[21]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," Micro, IEEE, vol. 28, no. 2, pp. 39--55, 2008.
[22]
J. Menon, M. De Kruijf, and K. Sankaralingam, "igpu: Exception support and speculative execution on gpus," in Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 2012, pp. 72--83.
[23]
NVIDIA, "Next generation CUDA computer architecture Kepler GK110," 2012.
[24]
NVIDIA, "Sharing a GPU between MPI processes: multi-process service (MPS) overview," 2013.
[25]
NVIDIA, "Programming guide - CUDA toolkit documentation," 2014. {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[26]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.
[27]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU concurrency with elastic kernels," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. ACM, 2013, pp. 407--418.
[28]
B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2014, pp. 743--758.
[29]
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 217--228.
[30]
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: operating system abstractions to manage GPUs as compute devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233--248.
[31]
Samsung, "Samsung Exynos," 2012. {Online}. Available: www. samsung.com/exynos
[32]
J. E. Smith and A. R. Pleszkun, "Implementation of precise interrupts in pipelined processors," in Proceedings of the 12th annual International Symposium on Computer Architecture, ser. ISCA '85, 1985, pp. 36--44.
[33]
M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: dynamic scheduling on GPUs," ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 161, 2012.
[34]
J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, G. Liu, and W. Hwu, "The Parboil benchmarks," Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Tech. Rep., 2012.
[35]
J. Stratton, S. Stone, and W.-m. Hwu, "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs," LCPC 2008, pp. 16--30, 2008.
[36]
N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in Proceedings of 12th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT 2003. IEEE, 2003, pp. 26--34.
[37]
J. Vera, F. J. Cazorla, A. Pajuelo, O. J. Santana, E. Fernandez, and M. Valero, "FAME: Fairly measuring multithreaded architectures," in Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. IEEE, 2007, pp. 305--316.
[38]
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU architecture," Micro, IEEE, vol. 31, no. 2, pp. 50--59, 2011.
[39]
J. Zhong and B. He, "Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling," arXiv preprint arXiv:1303.5164, 2013.

Cited By

View all
  • (2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
  • (2021)Enabling Real-Time Irregular Data-Flow Pipelines on SIMD Devices50th International Conference on Parallel Processing Workshop10.1145/3458744.3473367(1-8)Online publication date: 9-Aug-2021
  • (2020)TelekineProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388301(817-834)Online publication date: 25-Feb-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
June 2014
566 pages
ISBN:9781479943944

Sponsors

Publisher

IEEE Press

Publication History

Published: 14 June 2014

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ISCA'14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)11
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
  • (2021)Enabling Real-Time Irregular Data-Flow Pipelines on SIMD Devices50th International Conference on Parallel Processing Workshop10.1145/3458744.3473367(1-8)Online publication date: 9-Aug-2021
  • (2020)TelekineProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388301(817-834)Online publication date: 25-Feb-2020
  • (2019)PagodaACM Transactions on Parallel Computing10.1145/33656576:4(1-23)Online publication date: 19-Nov-2019
  • (2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
  • (2019)Serving deep neural networks at the cloud edge for vision applications on mobile platformsProceedings of the 10th ACM Multimedia Systems Conference10.1145/3304109.3306221(36-47)Online publication date: 18-Jun-2019
  • (2018)GravitonProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291219(681-696)Online publication date: 8-Oct-2018
  • (2018)A Method for Estimating Task Granularity for Automating GPU Cycle SharingProceedings of the 2018 VII International Conference on Network, Communication and Computing10.1145/3301326.3301386(133-139)Online publication date: 14-Dec-2018
  • (2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
  • (2018)Enabling OpenCL Preemptive Multitasking Using Software CheckpointingWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229725(1-7)Online publication date: 13-Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media