Abstract
Graphics Processing Units (GPUs) perform highly efficient parallel execution for high-performance computation and embedded system domains. While performance concerns drive the main optimization efforts, power issues become important for energy-efficient GPU executions. While performance profilers and architectural simulators offer statistics about the target execution, they either present only performance metrics in a coarse kernel function level or lack visualization support that enables performance bottleneck analysis or performance-power consumption comparison. Evaluating both performance and power consumption dynamically at runtime and across GPU memory components enables a comprehensive tradeoff analysis for GPU architects and software developers. This paper presents a novel memory performance and power monitoring tool for GPU programs, GPPRMon, which performs a systematic metric collection and offers useful visualization views to track power and performance optimizations. Our simulation-based framework dynamically collects microarchitectural metrics by monitoring individual instructions and reports achieved performance and power consumption information at runtime. Our visualization interface presents spatial and temporal views of the execution. While the first demonstrates the performance and power metrics across GPU memory components, the latter shows the corresponding information at the instruction granularity in a timeline. Our case study reveals the potential usages of our tool in bottleneck identification and power consumption for a memory-intensive graph workload.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Current data processing tasks require high-performance and energy-efficient computer systems with heterogeneous components. Massively parallel GPU architectures play a crucial role in accelerating parallel workloads such as streaming- and ML-based applications. Since the recent technological developments in GPUs result in more complex systems, it requires more detailed research effort to increase the execution performance and throughput. Widespread usage of GPUs has raised energy consumption in GPU-based systems and caused power issues to be addressed.
Although GPUs have large computational power, both performance and energy efficiency may decrease for memory-intensive workloads [2, 3]. Based on performance improvement and energy-efficiency concerns [4, 7, 10, 11, 14], reasoning a GPU application’s performance bottlenecks and interpreting power consumption during the execution requires more analytical measurements and rigorous evaluations. While both performance and energy improvements contribute efficient execution of GPU programs, they usually compete with each other, and the design decisions become critical and get complicated requiring the evaluation of the tradeoffs between both factors [1, 8, 12]. However, evaluating a kernel performance and relating power consumption at kernel basis hides most of the clear evidence for conducting a baseline analysis from many perspectives. Each instruction belonging to a warp, the smallest execution element in the GPU execution context, should be explicitly investigated throughout the execution on cores, as mostly done in multi-core architecture research. Despite that, NVIDIA’s GPU profiler (i.e., Nsight Compute Tool), which presents occupancy, IPC, and memory utilization, and the state-of-the-art GPU simulation tools [6, 13] report the performance and hardware metrics at the kernel level. None of the tools directly reports GPU programs’ dynamic performance, memory access behavior, and power consumption at runtime. Collecting a microarchitectural metric set requires additional effort, and several in-house target-specific works exist for monitoring the runtime performance and power consumption. Nevertheless, repetitive studies to inspect GPU execution dynamically cause redundant effort for the research studies. While a set of profiling tools reports performance and power consumption information about GPU execution, microarchitecture simulators are quite significant in terms of accurately modeling hardware and monitoring the execution behaviors of applications both on micro and macro scales by addressing energy consumption.
In this work, we design and build GPPRMon, a runtime performance and power monitoring tool for GPU programs. We target program developers and system architects, who aim to optimize the GPU programs or the hardware considering both performance improvement and energy efficiency based on the dynamic behavior of memory accesses and thread blocks. Our simulation-based framework dynamically collects microarchitectural metrics by monitoring individual instructions and reports performance and power consumption at runtime. Our visualization interface presents both spatial and temporal views of the execution, where the first demonstrates the performance and power metrics for each hardware component including global memory and caches, and the latter shows the corresponding information at the instruction granularity in a timeline. Our tool enables the user to perform a fine-granularity evaluation of the target execution by observing instruction-level microarchitectural features. To the best of our knowledge, this is the first work monitoring a GPU kernel’s performance by visualizing the execution of instructions for multiple user-configurable scenarios, relating memory hierarchy utilization with performance, and tracking power dissipation at runtime. Our main contributions are as follows:
-
We propose a systematic metric collection that keeps track of instruction-per-cycle (IPC) per streaming multiprocessor (SM) for performance, instruction execution records for each warp to clearly observe issues/completions, memory usage statistics per each sub-unit to interpret its effect on performance and power dissipation statistics per each GPU component at runtime with a configurable sampling cycle by extending the GPGPU-Sim framework.
-
By processing metrics, we design and build a visualization framework that concurrently runs with the simulator and displays a kernel’s execution status with the following three perspectives: 1) General View displays the IPC, access statistics of L1D and L2 caches, row buffer utilization of DRAM partitions, and dissipated power by the main components for an execution interval. 2) Temporal View shows the instruction details with issue and completion cycles for each thread block at warp level. In addition to power consumption statistics for the sub-components in an SM, we place L1D cache access statistics to relate the thread block’s performance in the same execution interval. 3) Spatial View demonstrates the access information and power consumption for each on-chip L1D cache, L2 cache in each sub-partition, and row buffers of DRAM banks in each memory partition.
-
We demonstrate the potential usages of our framework by performing experiments for a memory-intensive graph workload. Our tool enables the users to perform detailed performance and power analysis for the target GPU executions.
2 Background
Modern GPU architectures employ a single-instruction-multiple-thread (SIMT) execution in their Streaming Multiprocessors (SM). Each SM includes multiple warp schedulers, an instruction dispatcher, a register file, single and double precision ALUs, tensor cores, special function units (SFU), and load/store units with on-chip memory. An interconnection network connects SMs to off-chip memory partitions on which DRAM and Last-Level caches (LLC) are placed. Apart from the load/stores, all instructions utilize on-chip execution units. However, load/store may require access from the off-chip whenever requested data cannot be found in L1D cache, and data access gets slower for memory instructions moving down the hierarchy.
GPU researchers mostly exploit GPGPU-Sim [6] (hereafter referred to as the simulator) to conduct experimental studies targeting NVIDIA GPUs among the other simulators as it has evolved by tracking developments on the real hardware, such as covering tensor cores in the previous two decades. The simulator provides functional and performance-driven modes such that functional mode enables developers to check the kernel’s functional correctness, whereas the performance model simulates the kernel for the configured GPU in a cycle-accurate manner. It officially supports many architectures including Volta GV100. AccelWattch power model [5], as part of the simulator, supporting Dynamic Voltage-Frequency Scaling (DVFS), measures energy dissipation for GV100 above 90% accuracy.
3 Methodology
We design and build the GPPRMon tool, which is available as open-sourceFootnote 1, to monitor and visualize kernel performance and power consumption at runtime. Figure 1 displays GPPRMon workflow consisting of two main stages:
Metric Collection,
Visualization. GPPRMon systematically calculates IPC rates per SM, records warp instructions with issue/completions cycles per thread block within each SM and dissipated power among the components, and collects memory access statistics for each partition of L1D/L2 caches and DRAM row buffers during the given execution interval based on a configuration provided by the user. Parts
and
in Fig. 1 demonstrate examples of the power and performance metrics, respectively. Part
reveals GPPRMon’s visualizer with three views to show performance overview, memory access statistics, and instruction execution timeline. We build our framework on top of the simulator, and both the metric collector and visualizer are compatible with official GPU configurations given as part of the simulator.
3.1 Metric Collection
GPPRMon systematically records microarchitectural performance and power metrics during the execution as depicted in Part
in Fig. 1.
3.1.1 Performance Metrics
L1D and L2 Caches: GPPRMon cumulatively counts the accesses on caches and exports the results at the end of the observation interval. A memory request’s access status on caches may be one of the following states: i) Hit: Data resides on the cache line; ii) Hit Reserved: The cache line is allocated for the requested data, but the data is still invalid; iii) Miss: The corresponding cache line causes several sector misses, resulting in a bigger dirty counter than the threshold for line eviction, evicting the corresponding cache line; iv) Reservation Failure: The situations in which a line allocation, MSHR entry allocation, or merging an entry with an existing in MSHR fail, or the miss queue is full to hold new requests result in reservation failure; v) Sector Miss: A memory request cannot find the data in the sector of a cache line (i.e., a sector is 32B, whereas a cache line is 128B); vi) MSHR hit: When the upcoming request’s sector miss has already been recorded, and request can merge with the existing entry, MSHR hit occurs.
Row Buffers of DRAM Banks: GPPRMon covers row buffer utilization metric collection at runtime within each memory partition for L2 cache misses as row buffer hits save from the additional activation latencies of accessing DRAM cells among the global memory accesses.
IPC: GPPRMon calculates IPC rates for each SM by counting instruction issues through active warp masks and within each configurable sampling execution interval. IPC rates on SMs oscillate during the execution depending on the workload behavior and memory traffic, and one can conduct IPC comparisons to figure out the relation between performance and behavior.
Instruction Monitor: GPPRMon records the issue and completion cycles of instructions with opcode, operand, and PC at warp level within each thread block separately.
3.1.2 Power Metrics
We develop GPPRMon to systematically collect the power distribution on SMs, memory partitions, and the interconnection network. We implement power metrics on top of the AccelWattch [5], which includes modeling for dynamic-voltage frequency scaling (DVFS). GPPRMon maintains the following measurements during runtime for each component apart from idle SM: Peak Dynamic(W), the maximum momentary power within the interval, Sub-threshold Leakage (W) and Gate Leakage (W), the leaked power (due to current leakage), and Runtime Dynamic (W), the total consumed power. Moreover, GPPRMon supports collecting power metrics either cumulatively or distinctly for each sample, starting from a kernel’s execution.
3.2 Visualization
By processing the collected metrics in Part
, GPPRMon depicts performance and power dissipation with three perspectives at runtime, as represented in Part
in Fig. 1, and enables pointing out detailed interaction of an application with hardware.
-
i
General View, Part
, presents the average memory access statistics, the overall IPC of GPU, and dissipated power among the major components with application- and architecture-specific information;
-
ii
Spatial View, Part
, displays the access statistics of all the memory units on the GPU device memory hierarchy and dissipated overall power among the memory partitions by enabling the monitoring of the entire GPU memory space;
-
iii
Temporal View, Part
, demonstrates instruction execution statistics with activation intervals at warp-level for user-specified thread blocks, L1D cache access characteristics and power distribution among the sub-components of SMs by activating the execution monitoring feature.
Figure 2, an example of our General View, presents the overview execution performance of Kernel 0 for SpMV [15] on GV100 GPU device. It displays average memory access statistics among the active L1D caches, L2 caches, and DRAM banks; average IPC value among the active SMs; dissipated power on major sub-GPU components within the interval of [55000, 56000]. The view includes grid (i.e., 1784 thread blocks) and block dimensions (i.e., 256 threads per block) with the number of actively used SMs so that the users can realize issue mappings of thread blocks to the SMs. For instance, Kernel 0 executes with an IPC rate of 1.08 and uses the memory hierarchy inefficiently due to high L1 miss and reservation failure rates in the corresponding interval. Moreover, memory partitions consume 75% of the total power dissipation, which validates that SMs mostly stay idle for the target execution.
Figure 3, an example of our Spatial View, shows the memory access statistics across the GPU memory hierarchy. On caches, the green emphasizes hit and hit reserved accesses concentration, while the red indicates miss and sector miss intensity, and the blue states reservation failures through miss queues or MSHR. Similarly, DRAM bank pixels are colored with a mixture of red and blue to specify the row buffer misses and hits, respectively. Spatial View provides a detailed analysis of the memory hierarchy utilization and dissipated power on memory partitions. We zoom in on some memory units in Fig. 3, which presents statistics for the Kernel 0 in the cycles of [51000, 51500]. In that interval, distinct L1D caches behave similarly, such that almost all L1D caches turn blue due to reservation failure concentration.
Figure 4, an example of our Temporal View, displays a thread block’s execution statistics at warp-level, L1D cache statistics, and dissipated power of core components in configurable execution intervals. It presents each warp’s PTX instruction sequence. The Issue/Completion column indicates the execution start and writeback times of warp instruction segments within any thread block. For instance, Fig. 4 reveals the execution monitoring of the Thread Block 2 on SM 2 for the Kernel 0 in the cycle range of [8000, 8500]. The instruction dispatcher unit issues two SP global loads pointed by PC = 368 and PC = 376 at cycles 8071 and 8072, and they are completed at cycles 8179 and 8178, respectively. Temporal View allows tracking execution duration per instruction.
GPPRMon execution overhead varies depending on the monitoring interval; as it increases, the number of I/O operations in terms of exporting results to output files reduces, decreasing the impact on the simulation time. To illustrate, simulating the SPMV benchmark [15] takes 98 min for the Higgs Twitter Mention data by recording both power and performance metrics per 5000 simulation cycles, while the baseline simulation (i.e., not collecting runtime performance and power metrics) completes the execution at 88 min on RTX 2060S configuration in our local infrastructure.
4 A Case Study: Performance Bottleneck Analysis and Its Power Impacts for a Memory-Intensive Workload
We run Page Ranking (PR) CUDA implementation to analyze a memory-bound GPU program and irregular memory access statistics with GPPRMon on Volta architecture-based GV100. The implementation iterates with the Contribution Step (K0), Pull Step (K1), and Linear Normalization (K2) kernels, and the number of iterations varies depending on the data size. Since totally elapsed cycles by application indicate that K1 dominates the execution at 99.7%, we focus on that kernel execution by monitoring the runtime observation intervals as 100, 500, 1000, 2500, 5000, 10000, 25000, 50000, and 100000 cycles.
Figure 5, which is part of General Overview, shows average access statistics on memory units in [5000, 100000] cycles. After caches warming up (i.e., 10000 GPU cycles), while the average miss rate on L1D caches oscillates in [0.14, 0.82], sector misses, which the simulator does not provide separately, vary in [0.05, 0.31] with the metrics sampled for each consecutive 20 cycles without accumulation. We can state that data pollution exists on L1D caches, which breaks exploiting cache locality. For example, K1 does not utilize locality on L1D cache since the MSHR hits among L1D misses oscillate slightly in [0.03, 0.08] during the execution. Moreover, Web-Stanford [9] graph size is five times larger compared to the L2 cache size; thus, the overall hit rate on L2 caches is quite high even if we can realize the data sparseness with L1D cache utilization. While the performance metrics hide the statistics of L2 as it counts misses before warming up at kernel launch, the actual L2 hit rate oscillates in [0.82, 0.95] with sampling per 500 cycles. The row buffer locality varies in [0.2, 0.85] in an unstable manner, which verifies data sparsity throughout the execution.
Figure 6 displays the instruction issue/completion timeline of 8 TBs (CTAs) running on SM0, and the first and second lines point to the load instructions whose PC = 296 (loads DP) and PC = 312 (loads SP), respectively. We merge multiple snapshots of Temporal View in Fig. 6 executing on the SM0 to evaluate the performance of load instructions at the same time. Figure 7, a snapshot of Spatial View, shows the memory access statistics of representative components within the same interval. We follow the access statistics on the memory hierarchy with Fig. 7 and relate the observations with the issue/completion duration of loads in Fig. 6.
After the kernel launch, each thread collects thread-specific information from parameter memory, which takes 250–450 cycles to process target data addressed with the thread’s private registers. The warp schedulers dispatch the load instructions pointed to by the PC = 296 (ld.global.u64), and all eight warps of TB24 (presented as CTA_ID = 24 in the figure) execute the instruction after Cycle 5455. Furthermore, Fig. 6 reveals that SM0 dispatches load instructions from the remaining TBs in the interval of [5470, 5786] after issuing the load instructions of TB24. Figure 7 verifies that no access occurs on some of L1D caches, while none of the L2 caches and DRAM banks are accessed during the preparation time in [5000, 5500] in Part
. According to the execution timeline in Fig. 6, none of the data brought to L1D cache of SM0 by the warps of TB24 after Cycle 6087 (Warp 6) provides an early completion of the instruction pointed with PC = 296, belonging to the remaining thread blocks. We bold some of the long latencies within each TB for each instruction separately. Additionally, a high reservation failure and no MSHR hit rates on L1D cache of SM0 in Part
confirms that the locality utilization among TBs on the SM0 is significantly low for the first load. If there was locality utilization on L1D cache, we would observe either larger MSHR hit rates or small latencies for the accesses for the completion of the same instructions just after Cycle 6087. Furthermore, Parts
reveal that reservation failed requests pointed by PC = 296 cause misses on L2 caches without MSHR merging. Thus, memory requests of the same instruction from different SMs cannot benefit locality on the L2 cache partitions and cause traffic in the lower levels of the memory hierarchy.
Parts
in Fig. 7 reveal that the access status of L1D mostly turns to the hit after Cycle 6000. Unlike the load instructions at PC = 296, the loads at PC = 312 (ld.global.u32) usually hit. The second line for each TB in Fig. 6 shows that the completion takes much fewer cycles for the loads at PC = 312. To illustrate, while TB504 completes the first load instructions within 2133 cycles, it takes 26 cycles for the second instructions, apart from Warp 63, whose requests result in a miss on both L1D and L2 caches. As a result, the loads at PC = 296 complete the execution in the range of [350, 2250] cycles, whereas the loads at PC = 312 take less than 50 cycles for most of the warps due to the increasing hits on L1D cache. Still, loads at PC = 296 delay the issue of second load instructions due to excessive latency. As a result, the data locality utilization for different memory instructions executing on the same SM may change for the sparse data processing. Additionally, one may follow such runtime behavior change by evaluating cache behaviors with Spatial View and memory operation execution statistics with Temporal View at the same time.
The dissipated power in Table 1 obtained with the GPPRMon tool matches with observations related to the performance. Registers load thread-specific data (from the parameter memory) during the 5000–5500 cycles, causing higher power consumption on SMs. In the following 4500 cycles, the memory partition’s power dissipation gets more than the SMs. In addition, the consumed power by the LD/ST unit is high due to the intense memory operations and pressure on L1D cache, and other units apart from the register file portion of the execution unit get lower after Cycle 5500. The results in Table 1 display that DRAM contains most of the dissipated power in the memory partitions with intense usage of high-bandwidth. As a result, when irregular memory accesses increase, which causes idle SMs and stalled memory pipelines, consumed power gets lower along with performance. However, some load instructions operate at a quarter speed compared to the ideal performance. Hence, addressed energy may be wasted because of the inefficient access behavior on memory.
As a result, we describe how a memory-bound application occupies GPU hardware with architectural visualizations, which include detailed microarchitectural metrics, and instruction execution timeline of the application to explicitly point out long latencies in this case study. In addition, we relate the power consumption patterns with the corresponding runtime execution behavior of the GPU application. In this context, GPPRMon allows us to investigate the runtime execution behaviors of GPU applications at the instruction level. During the execution, users can certainly evaluate how each memory component is exploited spatially. While tracking both GPU occupancy and workload behavior in runtime, users can also track dissipated power at runtime via GPPRMon.
5 Conclusion
GPPRMon proposes a systematic runtime metric collection of instruction monitoring, performance, memory access, and power consumption metrics and it provides the multi-perspective visualizer framework that displays performance, execution statistics of the workload, occupancy of the memory hierarchy, and dissipated power results to conduct baseline analysis on GPUs at runtime. GPPRMon will help to conduct baseline analysis for the literature concerning GPU performance and power dissipation and eliminate the need for additional in-house efforts that involve real-time monitoring and profiling support.
Change history
04 September 2024
A correction has been published.
References
Guerreiro, J., Ilic, A., Roma, N., Tomás, P.: DVFS-aware application classification to improve GPGPUs energy efficiency. Parallel Comput. 83, 93–117 (2019)
Hong, J., Cho, S., Kim, G.: Overcoming memory capacity wall of GPUs with heterogeneous memory stack. IEEE Comput. Archit. Lett. 21(2), 61–64 (2022)
Jain, P., et al.: Checkmate: breaking the memory wall with optimal tensor rematerialization. CoRR abs/1910.02653 (2019). http://arxiv.org/abs/1910.02653
Jog, A., et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 395–406 (2013)
Kandiah, V., et al.: AccelWattch: a power modeling framework for modern GPUs. In: International Symposium on Microarchitecture (MICRO), pp. 738–753 (2021)
Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-Sim: an extensible simulation framework for validated GPU modeling. In: International Symposium on Computer Architecture (ISCA), pp. 473–486 (2020)
Koo, G., Oh, Y., Ro, W.W., Annavaram, M.: Access pattern-aware cache management for improving data utilization in GPU. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 307–319 (2017)
Krzywaniak, A., Czarnul, P., Proficz, J.: GPU power capping for energy-performance trade-offs in training of deep convolutional neural networks for image recognition. In: Computational Science - ICCS 2022, pp. 667–681 (2022)
Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6 (2008)
Lew, J., et al.: Analyzing machine learning workloads using a detailed GPU simulator. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 151–152 (2019)
O’Neil, M.A., Burtscher, M.: Microarchitectural performance characterization of irregular GPU kernels. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139 (2014)
Sun, Y., et al.: Evaluating performance tradeoffs on the Radeon open compute platform. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 209–218 (2018)
Ubal, R., Jang, B., Mistry, P., Schaa, D., Kaeli, D.: Multi2Sim: a simulation framework for CPU-GPU computing. In: International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 335–344 (2012)
Vijaykumar, N., Ebrahimi, E., Hsieh, K., Gibbons, P.B., Mutlu, O.: The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs. In: International Symposium on Computer Architecture (ISCA), pp. 829–842 (2018)
Xu, Z., Chen, X., Shen, J., Zhang, Y., Chen, C., Yang, C.: GARDENIA: a graph processing benchmark suite for next-generation accelerators. ACM J. Emerg. Technol. Comput. Syst. 15(1), 1–13 (2019)
Acknowledgement
This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 122E395. This work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Topçu, B., Öz, I. (2024). GPPRMon: GPU Runtime Memory Performance and Power Monitoring Tool. In: Zeinalipour, D., et al. Euro-Par 2023: Parallel Processing Workshops. Euro-Par 2023. Lecture Notes in Computer Science, vol 14352. Springer, Cham. https://doi.org/10.1007/978-3-031-48803-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-48803-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48802-3
Online ISBN: 978-3-031-48803-0
eBook Packages: Computer ScienceComputer Science (R0)