Keywords

1 Introduction

Current data processing tasks require high-performance and energy-efficient computer systems with heterogeneous components. Massively parallel GPU architectures play a crucial role in accelerating parallel workloads such as streaming- and ML-based applications. Since the recent technological developments in GPUs result in more complex systems, it requires more detailed research effort to increase the execution performance and throughput. Widespread usage of GPUs has raised energy consumption in GPU-based systems and caused power issues to be addressed.

Although GPUs have large computational power, both performance and energy efficiency may decrease for memory-intensive workloads [2, 3]. Based on performance improvement and energy-efficiency concerns [4, 7, 10, 11, 14], reasoning a GPU application’s performance bottlenecks and interpreting power consumption during the execution requires more analytical measurements and rigorous evaluations. While both performance and energy improvements contribute efficient execution of GPU programs, they usually compete with each other, and the design decisions become critical and get complicated requiring the evaluation of the tradeoffs between both factors [1, 8, 12]. However, evaluating a kernel performance and relating power consumption at kernel basis hides most of the clear evidence for conducting a baseline analysis from many perspectives. Each instruction belonging to a warp, the smallest execution element in the GPU execution context, should be explicitly investigated throughout the execution on cores, as mostly done in multi-core architecture research. Despite that, NVIDIA’s GPU profiler (i.e., Nsight Compute Tool), which presents occupancy, IPC, and memory utilization, and the state-of-the-art GPU simulation tools [6, 13] report the performance and hardware metrics at the kernel level. None of the tools directly reports GPU programs’ dynamic performance, memory access behavior, and power consumption at runtime. Collecting a microarchitectural metric set requires additional effort, and several in-house target-specific works exist for monitoring the runtime performance and power consumption. Nevertheless, repetitive studies to inspect GPU execution dynamically cause redundant effort for the research studies. While a set of profiling tools reports performance and power consumption information about GPU execution, microarchitecture simulators are quite significant in terms of accurately modeling hardware and monitoring the execution behaviors of applications both on micro and macro scales by addressing energy consumption.

In this work, we design and build GPPRMon, a runtime performance and power monitoring tool for GPU programs. We target program developers and system architects, who aim to optimize the GPU programs or the hardware considering both performance improvement and energy efficiency based on the dynamic behavior of memory accesses and thread blocks. Our simulation-based framework dynamically collects microarchitectural metrics by monitoring individual instructions and reports performance and power consumption at runtime. Our visualization interface presents both spatial and temporal views of the execution, where the first demonstrates the performance and power metrics for each hardware component including global memory and caches, and the latter shows the corresponding information at the instruction granularity in a timeline. Our tool enables the user to perform a fine-granularity evaluation of the target execution by observing instruction-level microarchitectural features. To the best of our knowledge, this is the first work monitoring a GPU kernel’s performance by visualizing the execution of instructions for multiple user-configurable scenarios, relating memory hierarchy utilization with performance, and tracking power dissipation at runtime. Our main contributions are as follows:

  • We propose a systematic metric collection that keeps track of instruction-per-cycle (IPC) per streaming multiprocessor (SM) for performance, instruction execution records for each warp to clearly observe issues/completions, memory usage statistics per each sub-unit to interpret its effect on performance and power dissipation statistics per each GPU component at runtime with a configurable sampling cycle by extending the GPGPU-Sim framework.

  • By processing metrics, we design and build a visualization framework that concurrently runs with the simulator and displays a kernel’s execution status with the following three perspectives: 1) General View displays the IPC, access statistics of L1D and L2 caches, row buffer utilization of DRAM partitions, and dissipated power by the main components for an execution interval. 2) Temporal View shows the instruction details with issue and completion cycles for each thread block at warp level. In addition to power consumption statistics for the sub-components in an SM, we place L1D cache access statistics to relate the thread block’s performance in the same execution interval. 3) Spatial View demonstrates the access information and power consumption for each on-chip L1D cache, L2 cache in each sub-partition, and row buffers of DRAM banks in each memory partition.

  • We demonstrate the potential usages of our framework by performing experiments for a memory-intensive graph workload. Our tool enables the users to perform detailed performance and power analysis for the target GPU executions.

2 Background

Modern GPU architectures employ a single-instruction-multiple-thread (SIMT) execution in their Streaming Multiprocessors (SM). Each SM includes multiple warp schedulers, an instruction dispatcher, a register file, single and double precision ALUs, tensor cores, special function units (SFU), and load/store units with on-chip memory. An interconnection network connects SMs to off-chip memory partitions on which DRAM and Last-Level caches (LLC) are placed. Apart from the load/stores, all instructions utilize on-chip execution units. However, load/store may require access from the off-chip whenever requested data cannot be found in L1D cache, and data access gets slower for memory instructions moving down the hierarchy.

GPU researchers mostly exploit GPGPU-Sim [6] (hereafter referred to as the simulator) to conduct experimental studies targeting NVIDIA GPUs among the other simulators as it has evolved by tracking developments on the real hardware, such as covering tensor cores in the previous two decades. The simulator provides functional and performance-driven modes such that functional mode enables developers to check the kernel’s functional correctness, whereas the performance model simulates the kernel for the configured GPU in a cycle-accurate manner. It officially supports many architectures including Volta GV100. AccelWattch power model [5], as part of the simulator, supporting Dynamic Voltage-Frequency Scaling (DVFS), measures energy dissipation for GV100 above 90% accuracy.

3 Methodology

We design and build the GPPRMon tool, which is available as open-sourceFootnote 1, to monitor and visualize kernel performance and power consumption at runtime. Figure 1 displays GPPRMon workflow consisting of two main stages: Metric Collection, Visualization. GPPRMon systematically calculates IPC rates per SM, records warp instructions with issue/completions cycles per thread block within each SM and dissipated power among the components, and collects memory access statistics for each partition of L1D/L2 caches and DRAM row buffers during the given execution interval based on a configuration provided by the user. Parts and in Fig. 1 demonstrate examples of the power and performance metrics, respectively. Part reveals GPPRMon’s visualizer with three views to show performance overview, memory access statistics, and instruction execution timeline. We build our framework on top of the simulator, and both the metric collector and visualizer are compatible with official GPU configurations given as part of the simulator.

Fig. 1.
figure 1

A general workflow overview of GPPRMon framework.

3.1 Metric Collection

GPPRMon systematically records microarchitectural performance and power metrics during the execution as depicted in Part in Fig. 1.

3.1.1 Performance Metrics

L1D and L2 Caches: GPPRMon cumulatively counts the accesses on caches and exports the results at the end of the observation interval. A memory request’s access status on caches may be one of the following states: i) Hit: Data resides on the cache line; ii) Hit Reserved: The cache line is allocated for the requested data, but the data is still invalid; iii) Miss: The corresponding cache line causes several sector misses, resulting in a bigger dirty counter than the threshold for line eviction, evicting the corresponding cache line; iv) Reservation Failure: The situations in which a line allocation, MSHR entry allocation, or merging an entry with an existing in MSHR fail, or the miss queue is full to hold new requests result in reservation failure; v) Sector Miss: A memory request cannot find the data in the sector of a cache line (i.e., a sector is 32B, whereas a cache line is 128B); vi) MSHR hit: When the upcoming request’s sector miss has already been recorded, and request can merge with the existing entry, MSHR hit occurs.

Row Buffers of DRAM Banks: GPPRMon covers row buffer utilization metric collection at runtime within each memory partition for L2 cache misses as row buffer hits save from the additional activation latencies of accessing DRAM cells among the global memory accesses.

IPC: GPPRMon calculates IPC rates for each SM by counting instruction issues through active warp masks and within each configurable sampling execution interval. IPC rates on SMs oscillate during the execution depending on the workload behavior and memory traffic, and one can conduct IPC comparisons to figure out the relation between performance and behavior.

Instruction Monitor: GPPRMon records the issue and completion cycles of instructions with opcode, operand, and PC at warp level within each thread block separately.

3.1.2 Power Metrics

We develop GPPRMon to systematically collect the power distribution on SMs, memory partitions, and the interconnection network. We implement power metrics on top of the AccelWattch [5], which includes modeling for dynamic-voltage frequency scaling (DVFS). GPPRMon maintains the following measurements during runtime for each component apart from idle SM: Peak Dynamic(W), the maximum momentary power within the interval, Sub-threshold Leakage (W) and Gate Leakage (W), the leaked power (due to current leakage), and Runtime Dynamic (W), the total consumed power. Moreover, GPPRMon supports collecting power metrics either cumulatively or distinctly for each sample, starting from a kernel’s execution.

3.2 Visualization

By processing the collected metrics in Part , GPPRMon depicts performance and power dissipation with three perspectives at runtime, as represented in Part in Fig. 1, and enables pointing out detailed interaction of an application with hardware.

  1. i

    General View, Part , presents the average memory access statistics, the overall IPC of GPU, and dissipated power among the major components with application- and architecture-specific information;

  2. ii

    Spatial View, Part , displays the access statistics of all the memory units on the GPU device memory hierarchy and dissipated overall power among the memory partitions by enabling the monitoring of the entire GPU memory space;

  3. iii

    Temporal View, Part , demonstrates instruction execution statistics with activation intervals at warp-level for user-specified thread blocks, L1D cache access characteristics and power distribution among the sub-components of SMs by activating the execution monitoring feature.

Fig. 2.
figure 2

General View visual displaying average cache, DRAM row buffer, power statistics, and kernel-specific information in a specific cycle interval.

Figure 2, an example of our General View, presents the overview execution performance of Kernel 0 for SpMV [15] on GV100 GPU device. It displays average memory access statistics among the active L1D caches, L2 caches, and DRAM banks; average IPC value among the active SMs; dissipated power on major sub-GPU components within the interval of [55000, 56000]. The view includes grid (i.e., 1784 thread blocks) and block dimensions (i.e., 256 threads per block) with the number of actively used SMs so that the users can realize issue mappings of thread blocks to the SMs. For instance, Kernel 0 executes with an IPC rate of 1.08 and uses the memory hierarchy inefficiently due to high L1 miss and reservation failure rates in the corresponding interval. Moreover, memory partitions consume 75% of the total power dissipation, which validates that SMs mostly stay idle for the target execution.

Fig. 3.
figure 3

Spatial View explicitly displays memory access statistics for each component belonging to the hierarchy within cycles at runtime. (Color figure online)

Figure 3, an example of our Spatial View, shows the memory access statistics across the GPU memory hierarchy. On caches, the green emphasizes hit and hit reserved accesses concentration, while the red indicates miss and sector miss intensity, and the blue states reservation failures through miss queues or MSHR. Similarly, DRAM bank pixels are colored with a mixture of red and blue to specify the row buffer misses and hits, respectively. Spatial View provides a detailed analysis of the memory hierarchy utilization and dissipated power on memory partitions. We zoom in on some memory units in Fig. 3, which presents statistics for the Kernel 0 in the cycles of [51000, 51500]. In that interval, distinct L1D caches behave similarly, such that almost all L1D caches turn blue due to reservation failure concentration.

Fig. 4.
figure 4

Temporal View monitors the instruction execution timeline together with on-chip cache performance and power consumption of the corresponding SM.

Figure 4, an example of our Temporal View, displays a thread block’s execution statistics at warp-level, L1D cache statistics, and dissipated power of core components in configurable execution intervals. It presents each warp’s PTX instruction sequence. The Issue/Completion column indicates the execution start and writeback times of warp instruction segments within any thread block. For instance, Fig. 4 reveals the execution monitoring of the Thread Block 2 on SM 2 for the Kernel 0 in the cycle range of [8000, 8500]. The instruction dispatcher unit issues two SP global loads pointed by PC = 368 and PC = 376 at cycles 8071 and 8072, and they are completed at cycles 8179 and 8178, respectively. Temporal View allows tracking execution duration per instruction.

GPPRMon execution overhead varies depending on the monitoring interval; as it increases, the number of I/O operations in terms of exporting results to output files reduces, decreasing the impact on the simulation time. To illustrate, simulating the SPMV benchmark [15] takes 98 min for the Higgs Twitter Mention data by recording both power and performance metrics per 5000 simulation cycles, while the baseline simulation (i.e., not collecting runtime performance and power metrics) completes the execution at 88 min on RTX 2060S configuration in our local infrastructure.

4 A Case Study: Performance Bottleneck Analysis and Its Power Impacts for a Memory-Intensive Workload

We run Page Ranking (PR) CUDA implementation to analyze a memory-bound GPU program and irregular memory access statistics with GPPRMon on Volta architecture-based GV100. The implementation iterates with the Contribution Step (K0), Pull Step (K1), and Linear Normalization (K2) kernels, and the number of iterations varies depending on the data size. Since totally elapsed cycles by application indicate that K1 dominates the execution at 99.7%, we focus on that kernel execution by monitoring the runtime observation intervals as 100, 500, 1000, 2500, 5000, 10000, 25000, 50000, and 100000 cycles.

Fig. 5.
figure 5

Detailed average memory access statistics for L1D caches, L2 caches, and DRAM row buffers in the cycle range of [5000, 100000].

Figure 5, which is part of General Overview, shows average access statistics on memory units in [5000, 100000] cycles. After caches warming up (i.e., 10000 GPU cycles), while the average miss rate on L1D caches oscillates in [0.14, 0.82], sector misses, which the simulator does not provide separately, vary in [0.05, 0.31] with the metrics sampled for each consecutive 20 cycles without accumulation. We can state that data pollution exists on L1D caches, which breaks exploiting cache locality. For example, K1 does not utilize locality on L1D cache since the MSHR hits among L1D misses oscillate slightly in [0.03, 0.08] during the execution. Moreover, Web-Stanford [9] graph size is five times larger compared to the L2 cache size; thus, the overall hit rate on L2 caches is quite high even if we can realize the data sparseness with L1D cache utilization. While the performance metrics hide the statistics of L2 as it counts misses before warming up at kernel launch, the actual L2 hit rate oscillates in [0.82, 0.95] with sampling per 500 cycles. The row buffer locality varies in [0.2, 0.85] in an unstable manner, which verifies data sparsity throughout the execution.

Fig. 6.
figure 6

Instruction execution timeline for two load instructions on SM0 in the cycle range of [5000, 30000].

Figure 6 displays the instruction issue/completion timeline of 8 TBs (CTAs) running on SM0, and the first and second lines point to the load instructions whose PC = 296 (loads DP) and PC = 312 (loads SP), respectively. We merge multiple snapshots of Temporal View in Fig. 6 executing on the SM0 to evaluate the performance of load instructions at the same time. Figure 7, a snapshot of Spatial View, shows the memory access statistics of representative components within the same interval. We follow the access statistics on the memory hierarchy with Fig. 7 and relate the observations with the issue/completion duration of loads in Fig. 6.

After the kernel launch, each thread collects thread-specific information from parameter memory, which takes 250–450 cycles to process target data addressed with the thread’s private registers. The warp schedulers dispatch the load instructions pointed to by the PC = 296 (ld.global.u64), and all eight warps of TB24 (presented as CTA_ID = 24 in the figure) execute the instruction after Cycle 5455. Furthermore, Fig. 6 reveals that SM0 dispatches load instructions from the remaining TBs in the interval of [5470, 5786] after issuing the load instructions of TB24. Figure 7 verifies that no access occurs on some of L1D caches, while none of the L2 caches and DRAM banks are accessed during the preparation time in [5000, 5500] in Part . According to the execution timeline in Fig. 6, none of the data brought to L1D cache of SM0 by the warps of TB24 after Cycle 6087 (Warp 6) provides an early completion of the instruction pointed with PC = 296, belonging to the remaining thread blocks. We bold some of the long latencies within each TB for each instruction separately. Additionally, a high reservation failure and no MSHR hit rates on L1D cache of SM0 in Part confirms that the locality utilization among TBs on the SM0 is significantly low for the first load. If there was locality utilization on L1D cache, we would observe either larger MSHR hit rates or small latencies for the accesses for the completion of the same instructions just after Cycle 6087. Furthermore, Parts reveal that reservation failed requests pointed by PC = 296 cause misses on L2 caches without MSHR merging. Thus, memory requests of the same instruction from different SMs cannot benefit locality on the L2 cache partitions and cause traffic in the lower levels of the memory hierarchy.

Parts in Fig. 7 reveal that the access status of L1D mostly turns to the hit after Cycle 6000. Unlike the load instructions at PC = 296, the loads at PC = 312 (ld.global.u32) usually hit. The second line for each TB in Fig. 6 shows that the completion takes much fewer cycles for the loads at PC = 312. To illustrate, while TB504 completes the first load instructions within 2133 cycles, it takes 26 cycles for the second instructions, apart from Warp 63, whose requests result in a miss on both L1D and L2 caches. As a result, the loads at PC = 296 complete the execution in the range of [350, 2250] cycles, whereas the loads at PC = 312 take less than 50 cycles for most of the warps due to the increasing hits on L1D cache. Still, loads at PC = 296 delay the issue of second load instructions due to excessive latency. As a result, the data locality utilization for different memory instructions executing on the same SM may change for the sparse data processing. Additionally, one may follow such runtime behavior change by evaluating cache behaviors with Spatial View and memory operation execution statistics with Temporal View at the same time.

Fig. 7.
figure 7

Sequential memory access statistic snapshots for some of L1D caches, L2 caches, and a DRAM row buffer in the cycle range of [5000, 9500].

Table 1. Dissipated average power in milliWatt during K1’s execution on GV100 in the cycle range of [5000,10000].

The dissipated power in Table 1 obtained with the GPPRMon tool matches with observations related to the performance. Registers load thread-specific data (from the parameter memory) during the 5000–5500 cycles, causing higher power consumption on SMs. In the following 4500 cycles, the memory partition’s power dissipation gets more than the SMs. In addition, the consumed power by the LD/ST unit is high due to the intense memory operations and pressure on L1D cache, and other units apart from the register file portion of the execution unit get lower after Cycle 5500. The results in Table 1 display that DRAM contains most of the dissipated power in the memory partitions with intense usage of high-bandwidth. As a result, when irregular memory accesses increase, which causes idle SMs and stalled memory pipelines, consumed power gets lower along with performance. However, some load instructions operate at a quarter speed compared to the ideal performance. Hence, addressed energy may be wasted because of the inefficient access behavior on memory.

As a result, we describe how a memory-bound application occupies GPU hardware with architectural visualizations, which include detailed microarchitectural metrics, and instruction execution timeline of the application to explicitly point out long latencies in this case study. In addition, we relate the power consumption patterns with the corresponding runtime execution behavior of the GPU application. In this context, GPPRMon allows us to investigate the runtime execution behaviors of GPU applications at the instruction level. During the execution, users can certainly evaluate how each memory component is exploited spatially. While tracking both GPU occupancy and workload behavior in runtime, users can also track dissipated power at runtime via GPPRMon.

5 Conclusion

GPPRMon proposes a systematic runtime metric collection of instruction monitoring, performance, memory access, and power consumption metrics and it provides the multi-perspective visualizer framework that displays performance, execution statistics of the workload, occupancy of the memory hierarchy, and dissipated power results to conduct baseline analysis on GPUs at runtime. GPPRMon will help to conduct baseline analysis for the literature concerning GPU performance and power dissipation and eliminate the need for additional in-house efforts that involve real-time monitoring and profiling support.