research-article

Open access

Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping

Authors:

Tong-Yu Liu,

Jianmei Guo,

Bo HuangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 1

Article No.: 8, Pages 1 - 26

https://doi.org/10.1145/3629525

Published: 19 January 2024 Publication History

PDF eReader

Abstract

Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs, but they are closed source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.

1 Introduction

Modern processors commonly provide performance monitoring units (PMUs). They are widely applied for workload characterization, performance evaluation, troubleshooting, and bottleneck identification. PMU is typically implemented by a set of hardware performance monitoring counters (PMCs) to monitor various microarchitecture performance events, such as clock cycles, retired instructions, cache misses, and branch mispredictions. The counts of events are usually used for deriving informative performance metrics, such as Cycles Per Instruction (CPI), Cache Misses Per Kilo Instructions (MPKI), and branch misprediction rate.

The number of PMCs is relatively limited compared to the number of events that need to be measured. The key challenge is how to measure sufficient events or metrics with limited PMCs. A widely used solution is multiplexing, i.e., grouping events, scheduling each group to take turns using PMCs [2, 5], and estimating results based on the actual count and the sampling ratio [34, 37].

Currently, Linux kernel has integrated perf_event subsystem to support PMC accessing [47, 48]. It also implements a round-robin style event scheduling policy for multiplexing [18]. Various user-mode-implemented interfaces or tools are developed on top of Linux perf_event, such as PAPI [5], LIKWID [43], and Linux perf [18].

Event groups are critical for multiplexing and related to the reliability of the derived metrics. Usually, events used to calculate a metric are required to be set in the same event group so that they are measured simultaneously. Existing profiling tools implement predefined event groups, such as LIKWID [43], either dynamically generated event groups based on required metrics, such as ARM Top-down Tool [31, 36]. However, existing methods and tools still have a few limitations as follows:

(1)

Inefficient grouping. Tools mainly focus on the reliability of metrics while neglecting the efficiency of multiplexing PMCs. It results in the underutilization of PMCs. For example, the predefined event groups contain only two to four events in LIKWID [43], usually less than the number of available PMCs. Commercial tools, such as Intel VTune/EMON [8, 11], can leverage PMCs. However, they do not disclose the grouping strategy and only support their specified platforms.

(2)

Event scheduling pitfall in Linux perf_event. When scheduling events to multiplex PMCs, Linux perf_event does not distinguish between fixed PMC events and generic PMC events. Current tools or interfaces are implemented directly or indirectly based on Linux perf_event [16, 26]. The scheduling pitfall also potentially results in the underutilization of PMCs and unfair scheduling.

(3)

Unawareness of available PMCs. The root cause of inefficient measurement is the unawareness of the number of available PMCs. Therefore, tools design event groups conservatively and cannot generate adaptive efficient event groups across different platforms [31, 36]. The unawareness may stem from the difficulty of obtaining the number of available PMCs. (a) Although the number of existing PMCs is possible to be dumped through the CPUID machine instruction in x86-64¹ platforms, the number of available PMCs might not be equivalent to it. Because there may exist processes occupying PMCs, such as Non-Maskable Interrupt (NMI) watchdogs [18] or other profilers. (b) Hardware vendors may customize architecture for application-specific requirements, especially for the latest ARM-based processors. This means that the number of PMCs may not always be accurately obtained from the documents.

The underutilization of PMCs results in low sampling ratios for metrics. It ultimately affects the accuracy of measurement [4] and has the risk of misleading analysis results. In addition, the underutilization brings about the requirement for more event groups. In cloud or serverless scenarios, the applications are usually short lived and latency sensitive [44]. More event groups bring about longer measurement periods and higher performance overhead, which may be unacceptable for troubleshooting or performance evaluation in these scenarios.

To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement by multiplexing PMCs via adaptive grouping. The contributions of this article are as follows:

(1)

We propose a purely data-driven and easy-to-use method for detecting the number of available PMCs across mainstream x86-64 and AArch64 platforms. On that basis, we propose an efficient method of multiplexing the detected PMCs to generate adaptive event groups without compromising the reliability of performance measurement. To the best of our knowledge, we are the first to propose adaptive grouping to tradeoff many performance events and few PMCs for efficient microarchitecture performance measurement across different platforms.

(2)

We disclose the detailed event scheduling policy of multiplexing PMCs in Linux perf_event subsystem based on a thorough analysis of the source code of Linux kernel. We reveal the scheduling pitfall that the kernel scheduler does not distinguish between fixed PMCs and generic PMCs. Moreover, we mitigated the pitfall in user mode to guarantee the efficiency of performance measurement.

(3)

We implement our approach and develop a cross-platform microarchitecture performance data collector—hperf.² The tool is developed using Python and on top of Linux perf. It is open source and currently validated for mainstream x86-64 and AArch64 processors.

(4)

We verify the correctness of our approach using SPEC CPU 2017. In addition, we evaluate the efficiency in the scenarios of performance evaluation. Compared to the state-of-the-art tools, including LIKWID and ARM Top-down Tool, the experiments demonstrate that our tool gains around 50% improvement in the average sampling ratio of performance metrics.

The remainder of this article is organized as follows: Section 2 introduces the preliminaries about microarchitecture performance data collection. Section 3 presents two motivating examples to illustrate the existing problems. Section 4 shows the approach overview. Our approach contains three methods, which are described in detail respectively in Sections 5–7. In Section 8, we illustrated the experiments to evaluate the correctness and efficiency of our approach. Section 9 summarizes related works of microarchitecture performance measurement, and Section 10 concludes this article.

2 Preliminaries

Collecting microarchitecture performance data requires hardware, Operating System (OS), and user-mode profiling tools to collaborate. In Linux, the collection typically relies on perf_event subsystem. The subsystem supports PMC accessing and provides other useful features for profiling, such as process attaching, generalized events, and PMC multiplexing [47, 48]. Figure 1 illustrated the control flow and the data flow based on Linux perf_event.

Fig. 1.

2.1 Performance Monitoring Units

PMU is a hardware component inside a modern processor to monitor specified microarchitecture events and count their occurrence. It supports collecting events in various aspects, such as instruction pipeline, cache, memory, storage, and IO, where the number of supported events has reached hundreds.

A PMU usually comprises a set of PMCs and their control registers. PMCs can be subdivided into fixed PMCs and generic PMCs. Fixed PMCs are only used to monitor designated events, usually the most frequently used, such as clock cycles and retired instructions. While generic PMCs are used to monitor and count other supported events. Their control registers are responsible for enabling or disabling the counters and specifying which events to be monitored.

The number of PMCs is usually small compared to the number of supported events [3, 32]. In modern processors, the number of fixed PMCs is usually between 1 and 4, while the number of generic PMCs is usually between 4 and 8 [1, 9, 29].³ Since a PMC can measure only one event at a time, the number of events that can be measured simultaneously is limited by the number of available PMCs. Therefore, resolving the conflict between too many events to be monitored and too few PMCs can be used is a key challenge when using PMUs for measurement.

2.2 Linux perf_event Subsystem

PMCs and their control registers are usually model-specific registers. Generally, these registers cannot be be accessed directly in user mode,⁴ so support from operating systems is necessary. Since Linux 2.6, perf_event subsystem has been officially introduced and included in Linux kernel. It provides OS-native support for PMU accessing and greatly facilitates the development of user-mode-implemented profiling tools and interfaces.

The notable features of Linux perf_event are listed as follows [47]:

(1)

Providing platform-independent aliases for commonly used hardware events (e.g., cycles, instructions, branch-misses), termed generalized events.⁵ Users can specify these events by generalized event names rather than event encoding.

(2)

Adopting multiplexing automatically under the round-robin-style scheduling policy, when events to be measured cannot be simultaneously assigned to underlying available PMCs, to make all events have the opportunity to be monitored by PMCs.

(3)

Supporting event groups to schedule all events in an event group simultaneously when multiplexing PMCs for measurement.

The subsystem provides interfaces in the form of system calls for profiling tools in user mode. Through this, the raw counts of events can be obtained.

2.3 Performance Metrics

Through the interface of Linux perf_event, profiling tools can obtain raw counts without further processing in user mode. Since events are not consistently monitored when multiplexing, the raw counts of events are not the actual counts, and thus the estimation is required. Profiling tools are typically responsible for processing raw counts, including estimating actual counts, calculating, and reporting informative performance metrics for subsequent analysis.

A workable approach of estimation is linear scaling, which is adopted by Linux perf, i.e., scaling the raw counts of each event according to the time of occupying PMCs, as shown in Equation (1) [18, 34],

\begin{equation} \hat{c} = \frac{c_{\text{raw}}}{t_{\text{enabled}} / t_{\text{total}}}, \end{equation}

(1)

where the \(\hat{c}\), \(c_{\text{raw}}\), \(t_{\text{enabled}}\), and \(t_{\text{total}}\) represent the estimated count, the raw count, the enabled time, and the total time, respectively. The ratio \(t_{\text{enabled}} / t_{\text{total}}\) represents the sampling ratio of an event. Usually, higher sampling ratios contribute to higher accuracy for measurements [4].

The counts are usually used for deriving informative performance metrics for subsequent analysis. For example, CPI, a fundamental metric for evaluating the effectiveness of instruction pipelines [22] can be derived by the counts of event cycles and instructions [52].

2.4 Event Groups

To ensure the reliability of derived metrics, it is required that all events related to a metric should be monitored simultaneously. Therefore, we usually set these events in the same event group. Different event groups take turns using PMCs for measurement. Otherwise, the reliability of the performance metrics may not be guaranteed.

In this article, for the sake of precision, an event list is denoted as an ordered \(n\)-tuple \(L\), where \(L\) contains \(n\) event groups (\(n \ge 1\)), i.e., \(L = (G_1, G_2, \ldots , G_n)\). For each event group \(G_i (i = 1,2,\ldots , n)\), it is denoted as a set that contains \(m\) events (\(m \ge 1\)), i.e., \(G_i = \lbrace e_{i_1}, e_{i_2}, \ldots , e_{i_m} \rbrace\). The number of events contained in \(G_i\) is denoted as \(| G_i |\).

Here is an example illustrating the importance of event groups for reliable metrics. To evaluate the performance of the branch predictor in the instruction pipeline, we calculate the branch misprediction rate (denoted as BrMispRate). The metric is derived from the count of the two events: branch-misses (retired mispredicted branches, denoted as \(e_4\)) and branches (retired branches, denoted as \(e_5\)). The formula is \(\text{BrMispRate} = e_4 / e_5\). Supposing there are three other independent events \(e_1, e_2, e_3\), to be monitored, and only four available generic PMCs, if we do not declare the event group for \(e_4\) and \(e_5\), the event groups are \((e_1, e_2, e_3, e_4, e_5)\). While if we do, then the event groups are \((e_1, e_2, e_3, \lbrace e_4, e_5\rbrace)\). The scheduling for the two different event groups is illustrated in Figure 2.

Fig. 2.

Without declaring event groups, using either the raw or the estimated counts to calculate the metric BrMispRate is unreliable, because branch-misses and branches are not counted simultaneously. For example, if a large number of branch mispredictions occur in the fifth time slice, where branches is counted while branch-misses is not counted, then the metric BrMispRate will be underestimated. The scheduling policy of Linux perf_event is elaborated in Section 6.

Therefore, declaring event groups is necessary for the reliability of derived metrics. Open source profiling tools expose the definition or generating strategy of event groups, such as LIKWID [43] and ARM Top-down Tool [31, 36]. They typically have predefined event groups or dynamically generate them as needed.

3 Motivating Examples

We present two motivating examples to illustrate the inefficiency that commonly exists in state-of-the-art profiling tools.

3.1 Irregular Sampling Ratios

We introduce a phenomenon encountered in measurement as a motivating example to reveal the existing scheduling pitfall in Linux perf_event. Linux perf is a component of Linux tools and provides various command-line options for profiling [18, 19]. It is implemented based on Linux perf_event. For counting-mode profiling, it will report the raw or estimated counts of specified events and the sampling ratios. If multiplexing is not triggered, then all reported sampling ratios should all be 100.00%.

In a system with an Intel Cascade Lake processor, we used Linux perf to count several selected microarchitecture events for measurement and then observed the reported sampling ratios of each event. The results are listed in Table 1. When passing the first seven events to Linux perf, we confirmed that multiplexing was not triggered. While passing one more event, the reported sampling ratios become confusing, because

Table 1.

#	Event\(^{1}\)	PMC type	Sampling ratios\(^{2}\)
#	Event\(^{1}\)	PMC type	7 events	8 events
\(e_1\)	inst_retired.any	Fixed	100.00%	62.56%
\(e_2\)	cpu_clk_unhalted.thread	Fixed	100.00%	75.15%
\(e_3\)	cpu_clk_unhalted.ref_tsc	Fixed	100.00%	87.57%
\(e_4\)	mem_inst_retired.all_loads	Generic	100.00%	87.59%
\(e_5\)	mem_inst_retired.all_stores	Generic	100.00%	87.57%
\(e_6\)	l1d.replacement	Generic	100.00%	87.58%
\(e_7\)	l2_lines_in.all	Generic	100.00%	87.57%
\(e_8\)	itlb_misses.stlb_hit	Generic	—\(^{3}\)	49.41%
Sum of sampling ratios			700.00%	625.00%

Table 1. Irregular Sampling Ratios (Intel Cascade Lake, Linux 5.15)

\(^1\)These events are architecture-defined and provided by hardware vendors.

\(^2\)The output of Linux perf retains two valid digits by default.

\(^3\)The symbol “—” represents that the corresponding event is not included in the event list in this iteration, hereafter inclusive.

(1)

Fixed PMCs monitor the first three events, while their sampling ratios are not 100.00%.

(2)

Although the round-robin algorithm aims to give all events an equal opportunity to occupy PMCs, there is an unexpectedly significant variation in the sampling ratio across different events, up to about 37.50%. Ideally, all events should have approximately equal sampling ratios when multiplexing is triggered.

(3)

Given the fact that there are three fixed PMCs and four generic PMCs configured inside this processor, the sum of the sampling ratios is decreased from 700.00% to 625.00% when the number of monitored events increases from 7 to 8. It indicates that some underlying PMCs may be unused during the measurement, otherwise, it should be closely equal to the number of available PMCs.

The phenomenon originated from the pitfall of conflating fixed PMCs and generic PMCs in Linux perf_event. We explained it in detail in Section 6.

3.2 Inefficient Event Groups

We present two scheduling examples under Linux perf_event to discuss the shortcomings of inefficient event groups. Since event groups are set to ensure reliable metrics, one possible method is grouping all events for a performance metric into an event group, which is adopted by ARM Top-down tool. For example, if we want to measure metric L1 MPKI, L2 MPKI, and last-level cache (LLC) MPKI, then three event groups will be generated. Another possible method is grouping a category of events into an event group, where these events are related to a specific hardware component and metrics can be derived from them. For example, LIKWID has various predefined event groups named L2CACHE, L3CACHE, BRANCH, and so on.

Usually, one metric requires two or three events to be monitored simultaneously. The number of events in each group does not exceed the number of available PMCs. Therefore, the method is feasible for scheduling.

However, these event groups may be sub-optimal, since they are not generated adaptively according to the available PMCs, resulting in underutilization. Besides, under the event scheduling policy of Linux perf_event (detailed in Section 6), these event groups may lead to unbalanced sampling ratios for each metric.

For example, there are four available generic PMCs and seven events monitored by generic PMCs (denoted from \(e_1\) to \(e_7\)), which are used to calculate three metrics. According to the method above, the event groups are \((\lbrace e_1, e_2 \rbrace , \lbrace e_3, e_4, e_5 \rbrace , \lbrace e_6, e_7 \rbrace)\). Then, the scheduling is illustrated in Figure 3.

Fig. 3.

Note that the underutilization occurs in some time slices in the scheduling. Besides, the sampling ratios are unbalanced, where the first event group has a distinctly higher sampling ratio (2/3) than the other two event groups (1/3). Although the round-robin algorithm is designed to ensure fair scheduling, such event groups contradict this purpose.

If the number of available PMCs is known in advance, then the event group \(G_1\) and \(G_3\) can be merged intuitively, i.e., the event groups will be \((\lbrace e_1, e_2, e_6, e_7 \rbrace , \lbrace e_3, e_4, e_5 \rbrace)\). Comparing the two different results, the latter has fairer scheduling and better PMC utilization (improved from 75.00% to 87.50%). In addition, the average sampling ratio is improved from 4/9 to 1/2, because the scheduling period is reduced.

From this example, as long as the number of available PMCs is known in advance, optimizing event groups for better adaption to the scheduling policy and leveraging underlying PMCs is possible.

4 Approach Overview

To address these problems mentioned above, we proposed an approach for adaptively multiplexing PMCs, aiming for efficient cross-platform microarchitecture performance measurement, as shown in Figure 4. The approach spans the whole process of collecting microarchitecture performance data. From the bottom up, they are as follows:

Fig. 4.

(1)

For hardware, detecting the number of available PMCs provided by Linux perf_event in arbitrary machines via a purely data-driven method. (Section 5)

(2)

For operating system, mitigating the existing pitfall of the scheduling policy for multiplexing PMCs, based on the thorough analysis of Linux perf_event. (Section 6)

(3)

For applications (profiling tools), adaptively generating efficient event groups for multiplexing PMCs based on the detected number of available PMCs, without losing reliability. (Section 7)

5 Detection of Available PMCs

The number of available PMCs is vital for better PMC utilization and efficient microarchitecture performance measurement. Under the scheduling policy of Linux perf_event, we need to set appropriate event groups according to the number of available PMCs, which are provided by Linux perf_event and can be used for measurement. Therefore, the detection is the basis for generating adaptive event groups.

Although the number of existing PMCs in a processor can be obtained from the documentation, and x86-64 processors support CPUID instruction to dump the information, the number of available PMCs that can be used for measurement is not necessarily equal to the number of existing PMCs. There may exist processes occupying PMCs, such as NMI watchdogs or other profilers, which results in not all existing PMCs can be used for measurement. In addition, we found an example where the actual number of existing PMCs did not match the documentation.

Of course, we can analyze the kernel behavior by debugging or tracing to get the number of available PMCs. However, we believe that such an approach may be overly complicated and difficult to be applied to the development of measurement tools. Hence, we believe that a user-mode-implemented method is preferable. In this section, we will describe an easy-to-use and purely data-driven method for detecting the number of available PMCs provided in Linux perf_event.

5.1 Method Ideas

The key idea is to find the maximum number of events that can be monitored simultaneously in Linux perf_event by extending an event list until the events are enough to occupy all available PMCs. At that stage, multiplexing is triggered and the sampling ratios can be easily used for inferring the number of available PMCs. Two factors may interfere with the PMC detection: fixed PMCs and event constraints.

First, for excluding the effects caused by fixed PMCs, we pin the events possibly monitored by fixed PMCs to avoid the pitfall of the scheduling policy. Hence, the available PMCs will be fully utilized and thus the number of available PMCs can be easily inferred. Second, for excluding the effects caused by event constraints, we apply a trial-and-error approach when adding an event to the event list to ensure the last added event does not have conflicts with the former events.

We handle the two factors properly in our method. The flowchart of the method is shown in Figure 5. We will elaborate on each step of the method in the following subsections.

Fig. 5.

5.2 Steps of the Detection

In the method, we conduct detection by initializing an event list and gradually increasing the number of events included until covering all available PMCs. After adding an event to the event list, we decide which action to take depending on the reported sampling ratios from Linux perf_event. The action may be (a) adding another event, (b) replacing the last added event, or (c) terminating the detection and outputting the result.

5.2.1 Initialization.

The only prerequisite is the knowledge of the events monitored by fixed PMCs on a specific processor. In the motivating example of irregular sampling ratios (Section 3.1), we noted that events monitored by fixed PMCs may have a negative impact on PMC utilization under the current scheduling policy of Linux perf_event. Only when all available PMCs are fully utilized can the number of available PMCs be easily inferred from the sampling ratios of events. Hence, these events monitored by fixed PMCs should be excluded. We use the mechanism of ”pinned” in Linux perf_event to make these events escape from scheduling.⁶

The initialized event lists of mainstream processors are summarized in Table 2 [9, 16, 29, 30]. This information is sourced from official documents so that the initialized event list covers all events monitored by fixed PMCs. In that case, the subsequent events to be added to the event list are monitored by generic PMCs, and only these events are involved in scheduling.

Table 2.

Platform	Initial event list
Intel Cascade Lake	(instructions, cycles, ref-cycles*)
Intel Ice Lake	(instructions, cycles, ref-cycles, slots)
AArch64	(cycles*)

Table 2. The Initial Event Lists for Detecting Available PMCs on Mainstream Platforms

5.2.2 Extension.

The initialized event list usually does not have sufficient events to cover all available PMCs. In that case, all sampling ratios are 100.00%. Therefore, we need to add more events to utilize all available PMCs fully. It is possible to select the added event via perf list as long as no duplicates with existing events in the event list, where these added events are not pinned and are monitored by generic PMCs.

For example, we select an event branch-misses in a system with an Intel Cascade Lake processor, and thus the extended event list for this iteration will be \((\texttt {instructions}^*, \texttt {cycles}^*,\) \(\texttt {ref-cycles}^*, \texttt {branch-misses})\). Then we pass the event list to Linux perf_event to conduct a measurement.

Extending the event list aims to achieve the stage, where all events except the pinned events have approximately equal sampling ratios but not 100.00%, as the sampling ratios in the “Pinned” column of Table 3 illustrates. The former condition determines if there is a conflict between events due to event constraints, while the latter determines whether all available PMCs are utilized. At that stage, multiplexing will be triggered and the maximum number of events are assigned to the available PMCs in each pass of scheduling, i.e., no PMC is unused during the measurement. Therefore, the sum of the sampling ratios of all events should be the number of available PMCs.

Table 3.

#	Event	PMC type	Sampling ratios\(^{1}\)
#	Event	PMC type	Default	Pinned
\(e_1\)	inst_retired.any	Fixed	62.56% (5/8)	100.00% (5/5)
\(e_2\)	cpu_clk_unhalted.thread	Fixed	75.15% (6/8)	100.00% (5/5)
\(e_3\)	cpu_clk_unhalted.ref_tsc	Fixed	87.57% (7/8)	100.00% (5/5)
\(e_4\)	mem_inst_retired.all_loads	Generic	87.59% (7/8)	79.81% (4/5)
\(e_5\)	mem_inst_retired.all_stores	Generic	87.57% (7/8)	79.81% (4/5)
\(e_6\)	l1d.replacement	Generic	87.58% (7/8)	79.88% (4/5)
\(e_7\)	l2_lines_in.all	Generic	87.57% (7/8)	80.19% (4/5)
\(e_8\)	itlb_misses.stlb_hit	Generic	49.41% (4/8)	80.25% (4/5)
Sum of sampling ratios			625.00% (4/8)	700.00% (4/5)

Table 3. Mitigation of the Scheduling Pitfall in Linux perf_event (Intel Cascade Lake, Linux 5.15)

\(^1\)The fractions labeled in parentheses are the theoretical sampling ratios extrapolated from the scheduling policy of Linux perf_event.

Note that we add events to the event list one by one, instead of selecting enough events at once, which is designed for the existence of event constraints. We will discuss this in detail in the next subsection.

5.3 Handling Event Constraints

Most events monitored by generic PMCs can be monitored by any available PMCs. However, for the minority, there exist special restrictions for PMC assignments. One type of event constraints specifies that some events can only be monitored by a specific generic PMC or PMCs [16]. Some of the constraints are accessible through public documentation [12], while some of the constraints are not publicly available.

Figure 6 illustrates the impact of event constraints on scheduling and sampling ratios. Without event constraints, the events scheduled in each pass will be smoothly assigned to available PMCs. Only when all PMCs are occupied do PMC assignment failures happen. In this case, the sampling ratios of each event should be approximately equal according to the round-robin scheduling algorithm. Hence, we can easily infer the number of available PMCs if multiplexing is triggered. However, if conflicts exist in some of the events, then they will interfere with PMC assignments in scheduling. For example, the specified PMC required by an event is already occupied by a previously assigned event, which will cause an assignment failure. Under the scheduling policy in Linux perf_event (detailed in Section 6.1), a PMC assignment failure will terminate the pass of scheduling and leave some available PMCs unused. This reduces the sampling ratios of the affected events. The sampling ratios of events appear irregular (not approximately equal), which is similar to the phenomenon described in Table 1. In this case, it is difficult to determine the number of available PMCs by the sampling ratios.

Fig. 6.

We need to ensure the events in the event list are non-conflicting for detection. Since obtaining exhaustive event constraints in advance is difficult, picking out a sufficient number of non-conflicting events at once is challenging. To this end, we use a trial-and-error approach as a workaround, i.e., adding events one by one. From the discussion above, it is easy to determine there is no conflict in the event list from the sampling ratios. After adding an event to the event list, the irregular (not approximately equal) sampling ratios indicate that the event conflicts with the former event list. In this case, we need to replace the last added event. In the flowchart shown in Figure 5, the step labeled as “Remove the last added event” aims to ensure that the event list is non-conflicting.

In summary, when we select an event and add it to the list, we first determine whether the event leads to conflicts, depending on whether the sampling ratios of the events except the pinned events are approximately equal. Then we determine whether the number of events is enough to trigger multiplexing, depending on whether the sampling ratios are all 100.00%. We extend the event list until the sampling ratios are in line with the stage we expected, and the detection is complete.

Considering the scheduling overhead, the sampling ratios of these events are not precisely equal. In practice, we assume them to be approximately equal if the absolute error is within 1.00%. In Section 8.2.3, we give an example from an experiment to verify the feasibility in the presence of event constraints via the trial-and-error approach.

6 Scheduling Policy of Linux Perf_event

In the motivating examples, we gave the example of irregular sampling ratios when multiplexing PMCs for measurement in Linux perf_event. In this section, we will discuss the scheduling policy in detail by giving the algorithm description of the scheduling policy, explaining the scheduling pitfall, and proposing a method to mitigate the pitfall in user mode.

6.1 Hierarchical Scheduling Policy

In Linux perf_event, the multiplexing will be automatically triggered when all events to be monitored cannot be assigned to available PMCs at the same time. In this case, the scheduler will adopt the round-robin style scheduling policy to multiplex PMCs for measurement. The round-robin algorithm is commonly applied to load balancing. Therefore, the scheduling policy aims at equal opportunities for each event group to be measured by PMCs.

Designed for cross-platform, the scheduling policy is hierarchical, including two layers: the platform-independent layer and the platform-dependent layer [15]. The platform-independent layer is consistent for any supported platform. The platform-dependent layer is related to the platform-specific PMC assignment so it varies across different platforms.

Based on the thorough analysis of the source code of perf_event subsystem in the latest stable Linux kernel (version 5.15), we summarized the scheduling policy. It is described in Algorithm 1. The features of the scheduling policy are listed below:

•

The unit of scheduling is an event group (line 2). All events in the same event group are scheduled simultaneously.

•

In the platform-independent layer (lines except 3–16), the scheduling is conducted in passes. Each pass is triggered at a fixed time interval, default 4 ms (line 22). The order of event groups is changed at each pass. Before scheduling, the scheduler maintains a queue of event groups in the same order the scheduler originally received. At the end of each pass, it shifts the first event group to the end of the queue (line 21).

•

In each pass, the platform-independent layer tries to send each event group to the platform-dependent layer (lines 3–16) one by one in the order of the queue for PMC assignment. For each event in the received event group, the platform-dependent layer checks all possible PMCs and assigns the first available PMC to the event (lines 11–16). If no available PMC is found, then a PMC assignment failure is sent to the platform-independent layer (line 16).

•

When the platform-independent layer receives the failure report, it terminates scheduling for this pass (line 19), i.e., the subsequent events groups are not passed to the platform-dependent layer for PMC assignment.

It is noteworthy that the scheduling of the platform-dependent layer is implemented slightly differently across platforms. Lines 3–16 in Algorithm 1 simply give a generalized policy for PMC assignment. For the current ARM implementation, the \(k\) is statically set to 30 (line 11) as the maximum number of PMCs, although not as many PMCs actually exist. When the scheduler assigns an event monitored by generic PMCs, it will look up from the first generic PMCs until the last to search available PMCs. For the x86-64 implementation, the scheduler queries the number of configured PMCs by the CPUID instruction beforehand and set the \(k\) to the number. In addition, the PMC assignment may also be affected by event constraints.

The phenomenon of irregular sampling ratios (Section 3.1) can be explained by the scheduling policy. Figure 7 illustrates the scheduling records of each pass for these eight events. In the second pass, \(e_1\) is moved to the end of the queue while \(e_2\) is advanced to the beginning of the queue. PMC assignment is conducted in the order of the queue. Although the fixed PMC required by \(e_1\) is unused, it is still unable to use the PMC in this pass, because the scheduling is terminated due to the PMC assignment failure for \(e_8\). The fixed PMCs are not used in the subsequent passes for the same reason.

Fig. 7.

Note that the scheduling results of each pass are periodic, where the period is equal to the number of event groups. There are eight events total so the scheduling result of the ninth pass will be the same as the first pass. Thus, we can extrapolate the theoretical sampling ratios of each event. For example, \(e_1\) is scheduled in five passes of eight passes, and therefore the theoretical sampling ratio will be 62.50% (5/8).

For all monitored events, the actual sampling ratios are very close to the theoretical sampling ratios, so the scheduling policy can be used to explain the reason behind the irregular sampling ratios. There is still a minor gap, which mainly originated from the scheduling and context-switching overhead.

6.2 The Scheduling Pitfall

The motivating example reveals the pitfall of the scheduling policy in Linux perf_event, i.e., conflating events monitored by fixed PMCs and events monitored by generic PMCs. According to the policy in the platform-independent layer, all events are status-equivalent, no matter whether it is monitored by fixed PMCs or generic PMCs. The platform-independent layer only focuses on how to make all events to be monitored have an equal opportunity to use PMC, which is implemented by adjusting the order of event groups at each pass of scheduling.

Such a policy is fair and reasonable when the events are all monitored by generic PMCs. Nevertheless, the pitfall arises when it comes to the fixed PMCs and their corresponding events. When a PMC assignment failure is reported from the platform-dependent layer, the scheduler is unable to determine whether there are still available PMCs for the subsequent events, because it does not know the status of the underlying PMCs. In that case, the scheduler assumes that there are no available PMCs and will not attempt to assign PMCs for the subsequent event groups and possibly leave some PMCs unused, which eventually decreases the sampling ratios of some affected events.

The scheduling pitfall may result in irregular sampling ratios and underutilization of PMCs. For users and developers who are unfamiliar with the underlying PMU architecture and the scheduling policy, the pitfall may affect the accuracy and reliability of microarchitecture performance measurement.

6.3 The Mitigation

From the motivating example, the existence of events monitored by fixed PMC may interfere with the scheduling policy. Fixed PMCs are designated to monitor these commonly used events so that they should always occupy their corresponding fixed PMCs and escape from scheduling when multiplexing.

The pitfall can be mitigated in user mode. The solution is to declare events monitored by fixed PMCs “pinned” manually when multiplexing PMCs for measurement. The interface of Linux perf_event supports the feature, which is mentioned and applied in detecting available PMCs. The purpose is to make these events escape from scheduling and occupy their corresponding PMCs consistently. The prerequisite of the solution is knowing which events are monitored by fixed PMCs.

For the same event list, if we pin the first three events to their corresponding fixed PMCs, then the event list is changed from \((e_1, e_2, e_3, e_4, \ldots , e_8)\) to \((e_1^*, e_2^*, e_3^*, e_4, \ldots , e_8)\). Compared to the default case without pinned events, the sampling ratios of each event are shown in Table 3. From this experimental result, it is clearly shown that when the first three events are pinned, the latter five events that are monitored by generic PMCs have almost equal sampling ratios of around 80.00%, indicating fair scheduling. In addition, the sum of sampling ratios increases, indicating better utilization of PMCs.

Note that the pitfall is also possible to be mitigated by modifying the code implementation of Linux kernel. The pitfall stems from the implementation of the platform-independent layer. It is incapable of automatically recognizing and handling the events monitored by fixed PMCs. Modifying the implementation of platform-independent scheduling to make these events always occupy their corresponding fixed PMCs is a possible solution to mitigate the pitfall. In this article, our mitigation is more straightforward. Since the scheduling cannot handle events monitored by fixed PMCs properly, removing events monitored by fixed PMCs is the simplest solution that does not require modifications to the kernel. It is easy to implement and apply to other tools or interfaces based on Linux perf_event. Therefore, we chose a user-mode workaround to mitigate the scheduling pitfall.

7 Adaptive Event Grouping

After detecting the number of available PMCs and mitigating the scheduling pitfall of Linux kernel, In this section, we illustrate the method of generating adaptive event groups based on the number of available PMCs.

From the motivating example (Section 3.2), under the scheduling policy of Linux perf_event, inefficient event groups may result in underutilization of PMCs and unfair scheduling for event groups. They may have a negative impact on the quality of performance metrics. Usually, current state-of-the-art tools do great jobs in grouping events for reliability, while they neglect the information of available PMCs to generate event groups.

To this end, we proposed a method for adaptive grouping to increase the efficiency of multiplexing PMCs for measurement, without compromising the reliability of performance metrics. The purpose is to make the scheduling fair for all event groups and to leverage underlying PMCs as many as possible.

The method contains two steps:

(1)

Based on the performance metrics to be measured, generating original event groups to ensure that events for the same metric are in the same event groups.

(2)

Based on the number of available PMCs detected, merging original event groups as much as possible to obtain adaptive event groups.

Step (1) aims to guarantee the reliability of derived metrics and step (2) aims to fair event scheduling and efficient PMC multiplexing. Given a set of original event groups \(S = \lbrace G_1, G_2, \ldots , G_n \rbrace\), the process of generating adaptive event groups is described in Algorithm 2.

The algorithm is designed based on a greedy strategy and its worst-case time complexity is \(O(n^2)\). Considering the algorithm is performed before the start of measurement, it only needs to be run once, so the overhead of the operation is tolerable.

During the merging process, the reliability of the metrics is not sacrificed, because the events for the same metric remain in the same event group. The purpose is to improve overall sampling ratios of performance metrics, thereby improving the quality of measurement ultimately. Moreover, the method can be easily adopted by other tools to optimize their predefined event groups or grouping strategies and instructive for users of PMC interfaces, such as PAPI.

8 Evaluation

We conducted a series of case studies to evaluate our approach. The evaluation aims at answering the following questions:

(1)

Is the method feasible to detect the number of available PMCs for mainstream processors? (Section 8.2)

(2)

Are the performance metrics measured by the tool correct? (Section 8.3)

(3)

How efficient is the tool’s measurement compared to other state-of-the-art tools? (Section 8.4)

8.1 Experimental Setups

In this article, we selected four different representative processors, covering mainstream x86-64 and AArch64 ISAs. They are Intel Xeon Gold 5218R (Cascade Lake), Intel Xeon Platinum 8352Y (Ice Lake), Hisilicon Kunpeng 920, and Ampere Altra, denoted as Machine A, B, C, and D, respectively. For Intel processors, Hyper-Threading is enabled.

Our methods and the tool are both implemented based on Linux perf, which is implemented based on perf_event subsystem. We conducted experiments on the latest stable version of Linux kernel (version 5.15).

We selected 10 benchmarks in SPECrate 2017 Integer suit (a part of SPEC CPU 2017 [13]) as the workload for evaluation. The configurations to run these benchmarks are based on the default configuration, where the version of GNU compiler is 11.3.0. We run the benchmarks in base tuning mode and with train input size. We chose this workload for the following reasons:

(1)

For evaluating the available PMC detection, considering the default scheduling interval of Linux perf_event is 4 ms, the running time of the workload is sufficient for enough passes of event scheduling.

(2)

For evaluating the correctness of measurement, the workload is computationally intensive and has stable performance characteristics [28, 39] across multiple runs. In addition, the workload runs for a relatively long time to ensure that the tools are capable of collecting sufficient samples while multiplexing PMCs for measurement.

8.2 Experiments on Available PMC Detection

We conducted a series of experiments on detecting the number of available PMCs across different platforms via the trial-and-error approach. Due to the limited length of the article, we present three representative experiments. The purposes of the selected experiments are as follows:

(1)

The first experiment (Section 8.2.1) is to validate the feasibility of detecting available PMCs on AArch64 platforms.

(2)

The second experiment (Section 8.2.2) is to demonstrate the effect of NMI watchdog on the number of available PMCs on x86-64 platforms.

(3)

The third experiment (Section 8.2.3) is to evaluate the capability of handling event constraints in the process of the detection on x86-64 platforms.

8.2.1 Detection for AArch64 Processors.

For AArch64 processors (machines C and D), we started with the initialized event list \((\texttt {cpu_cycles}^*)\) where the event cpu_cycles are pinned. In machine C, We gradually extended the event list until reached the expected stage according to the trial-and-error detection method. The added events are selected arbitrarily from the output of pref list on the machine. Table 4 shows the process of extending the event list the sampling ratios reported by Linux perf at each iteration.

Table 4.

#	Event	Iterations
#	Event	1	2	...	13	14
1	cpu_cycles*	100.00%	100.00%	...	100.00%	100.00% (13/13)
2	br_mis_pred	—	100.00%	...	100.00%	92.84% (12/13)
3	br_mis_pred_retired	—	—	...	100.00%	92.84% (12/13)
4	br_pred	—	—	...	100.00%	92.84% (12/13)
5	br_retired	—	—	...	100.00%	92.87% (12/13)
6	br_return_retired	—	—	...	100.00%	92.86% (12/13)
7	bus_access	—	—	...	100.00%	92.86% (12/13)
8	bus_cycles	—	—	...	100.00%	92.86% (12/13)
9	cid_write_retired	—	—	...	100.00%	92.86% (12/13)
10	dtlb_walk	—	—	...	100.00%	92.86% (12/13)
11	exc_return	—	—	...	100.00%	92.86% (12/13)
12	exc_taken	—	—	...	100.00%	92.86% (12/13)
13	inst_retired	—	—	...	100.00%	92.86% (12/13)
14	inst_spec	—	—	...	—	92.84% (12/13)
Sum of Sampling ratios						1307.11% (12/13)

Table 4. Detection of Available PMCs (Hisilicon Kunpeng 920)

During the extension, we did not encounter the event constraints for the sampling ratios did not appear irregular in each iteration. At the 14th iteration, we observed that the sampling ratios of the events except the pinned event were almost equal (within 1.00% absolute error), indicating we reached the stage we expected. We calculated the sum of the sampling ratios, which is 1307.11%. Therefore, the number of available PMCs in machine C is 13 (we picked the nearest integer), and one of them is a fixed PMC designated for cpu_cycles.

8.2.2 Detection for x86-64 Processors with NMI Watchdog Enabled.

Linux kernel supports NMI watchdog on x86-64 platforms. The feature is for detecting hard and soft lockups [23, 25]. By default, NMI watchdog is enabled. However, for microarchitecture measurement, enabled NMI watchdog may cause system crashes [24] or affect the accuracy of performance data [10]. Hence, NMI watchdog is typically disabled during the measurement [6, 10, 51].

To demonstrate the effect of enabling NMI watchdog on the number of available PMCs, we experimented with the detection. The detection process is shown in Table 5. The example illustrated in Table 3 (Section 6.3) has already indicated that there were three fixed PMCs and four generic PMCs available in Intel Cascade Lake processors when the NMI watchdog is disabled. The result indicated that only three generic PMCs were available when NMI watchdog is enabled.

Table 5.

#	Event	Iterations
#	Event	1	2	3	4	5
1	inst_retired.any*	100.00%	100.00%	100.00%	100.00%	100.00% (4/4)
2	cpu_clk_unhalted.thread*	100.00%	100.00%	100.00%	100.00%	100.00% (4/4)
3	cpu_clk_unhalted.ref_tsc*	100.00%	100.00%	100.00%	100.00%	100.00% (4/4)
4	mem_inst_retired.all_loads	—	100.00%	100.00%	100.00%	74.99% (3/4)
5	mem_inst_retired.all_stores	—	—	100.00%	100.00%	75.01% (3/4)
6	br_inst_retired.all_branches	—	—	—	100.00%	75.01% (3/4)
7	br_misp_retired.all_branches	—	—	—	—	74.99% (3/4)
Sum of Sampling ratios						600.00% (4/4)

Table 5. Detection of Available PMCs (Intel Cascade Lake, NMI Watchdog Enabled)

From the result, we validated that the enabled NMI watchdog occupies a generic PMC on x86-64 platforms. Thus, in addition to affecting measurement quality, enabled NMI watchdog exacerbates the conflict between the number of available PMCs and the number of events to be monitored. In this article, experiments on x86-64 platforms were conducted with the NMI watchdog disabled if not otherwise specified.

8.2.3 Detection for x86-64 Processors with Event Constraints.

The experiment for x86-64 processors is mainly to verify the capability of detecting available PMCs when event constraints exist.

In fact, most of the events selected from the output of perf list do not have event constraints. In most cases, we will not encounter event constraints and get the same phenomenon as the motivating examples. Therefore, we deliberately selected a constrained event to conduct the detection. The process of extending the event list when we encounter event constraints in machine A is illustrated in Table 6.

Table 6.

#	Event	Iterations
#	Event	1	2	3	4	5	6	7
1	inst_retired.any*	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00% (5/5)
2	cpu_clk_unhalted.thread*	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00% (5/5)
3	cpu_clk_unhalted.ref_tsc*	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00% (5/5)
4	mem_inst_retired.all_loads	—	100.00%	100.00%	33.34%	100.00%	100.00%	79.90% (4/5)
5	mem_inst_retired.all_stores	—	—	100.00%	66.71%	100.00%	100.00%	79.94% (4/5)
6	inst_retired.prec_dist1	—	—	—	0.00%	—	—	—
7	l1d.replacement	—	—	—	—	100.00%	100.00%	79.93% (4/5)
8	l2_lines_in.all	—	—	—	—	—	100.00%	80.14% (4/5)
9	itlb_misses.stlb_hit	—	—	—	—	—	—	80.17% (4/5)
Sum of Sampling ratios								700.08% (4/5)

Table 6. Detection of Available PMCs (Intel Cascade Lake, with Event Constraints, NMI Watchdog Disabled)

\(^1\)This event was detected to have conflicts with the former events and replaced.

At the fourth iteration, we selected event inst_retired.prec_dist and observed the phenomenon of irregular sampling ratios. As mentioned above, it stems from event constraints so we need to replace this event. At the fifth iteration, we substituted the event with another event, then the sampling ratios are all 100%, so we could add more events to the event list. Finally, we formed the event list to trigger multiplexing thereby successfully detecting the number of available PMCs while avoiding the effect of event constraints. The detected number of available PMCs was as we expected.

8.2.4 Discussions.

The three experiments above illustrate in detail how the data-driven detection for available PMCs works on machines A and C. We did the similar experiments on machine B and D. In summary, the number of detected PMCs for the four processors are shown in Table 7. To verify the correctness of the detection, we compared the results from the documents provided by the hardware vendors.

Table 7.

Machine	ISA	Processor	From detection		From documentation [9, 29]
Machine	ISA	Processor	# Fixed PMCs	# Generic PMCs	# Fixed PMCs	# Generic PMCs
A	x86-641	Intel Xeon Gold 5218R (Cascade Lake)	3	4	3	4
B	x86-641	Intel Xeon Platinum 8352Y (Ice Lake)	4	8	4	8
C	AArch64	Hisilicon Kunpeng 920	1	12	1	6
D	AArch64	Ampere Altra	1	6	1	6

Table 7. Results of the Detection of Available PMCs for Mainstream Processors

\(^1\)NMI watchdogs are disabled.

Bolded numbers indicate that the results obtained by the detection do not match the documentation.

From the experimental results, we discovered some interesting facts. Hisilicon Kunpeng 920 and Ampere Altra are both designed based on ARMv8-A architecture. Although the ISA document [29] defines that the number of generic PMCs contained in each core is 6. The vendor of Kunpeng 920 increased the number of generic PMCs to 12 and does not disclose this feature in any public documents as far as we know. We have verified the result with engineers who are familiar with the PMU design of Kunpeng 920.

The detection has the following limitations:

(1)

The method needs to exclude the interference of the events monitored by fixed PMCs beforehand, so there is a prerequisite of the knowledge of the events possibly monitored by fixed PMCs. If we do not cover all events monitored by fixed PMCs, then it is impossible to detect all available PMCs.

(2)

The method only focuses on core PMCs, which are deployed inside cores. In multi-core processors, some hardware components are located outside cores and shared by multiple cores, called uncore units, such as LLC and integrated memory controllers. These uncore units have their own PMCs for monitoring events. So far, the method is unable to detect that kind of PMCs.

8.3 Evaluation for Correctness

For profiling tools, we need first to ensure that the performance metrics obtained from measurement are correct and reliable. Otherwise, the subsequent analysis will be meaningless. For this purpose, we need to verify the correctness of the measurements of the tool implemented based on our approach.

8.3.1 Baseline.

Hardware measurements are usually subject to random errors [14]. It is hard to obtain absolutely accurate measurement results as a criterion for accuracy, it is hard to obtain absolutely accurate measurement results as a criterion for accuracy. A workaround is choosing a recognized well-established profiling tool as the baseline. As long as the results obtained by our tool are similar to the well-established tool for the same workload, then we assume the measurement is correct.

We chose Intel VTune/EMON as the baseline for x86-64 processors. It can automatically generate measurement plans and event groups based on the events to be measured and guarantee the reliability of the derived metrics at the same time. Engineers generally acknowledge the reliability of Intel VTune/EMON, since the hardware vendor provides it. We assumed that Intel’s self-developed tools are capable of obtaining the most accurate performance data on the processors they produce.

To the best of our knowledge, we did not find a recognized reliable measurement tool for evaluating correctness on AArch64 platforms. For the moment, we chose ARM Top-down Tool, the official tool Arm provided as the baseline.

8.3.2 Experimental Results.

For each benchmark, we conducted three individual measurements and calculated the average of each metric. We use Mean Absolute Error (MAE) to evaluate the measurement error between our tool and the baseline.

Table 8 lists the results. The largest measurement error occurred in 505.mcf_r on Machine B. Table 9 compares the performance metrics measured by our tool and the baseline where the maximum measurement error occurs. For the largest measurement error, the metrics measured by our tool still have a high similarity to the baseline. Therefore, we confirmed that measured performance metrics are correct and reliable for subsequent analysis.

Table 8.

Benchmark	MAE1
Benchmark	Machine A	Machine B	Machine C	Machine D
500.perlbench_r	0.020 562	0.004 788	0.009 128	0.015 003
502.gcc_r	0.006 456	0.017 094	0.004 692	0.027 616
505.mcf_r	0.025 527	0.043 206	0.019 491	0.032 023
520.omnetpp_r	0.025 376	0.013 191	0.015 963	0.041 822
523.xalancbmk_r	0.021 706	0.032 431	0.003 879	0.004 574
525.x264_r	0.002 848	0.009 527	0.004 424	0.004 793
531.deepsjeng_r	0.006 511	0.021 357	0.006 367	0.004 238
541.leela_r	0.004 657	0.003 693	0.008 190	0.005 045
548.exchange2_r	0.012 307	0.015 928	0.005 519	0.003 172
557.xz_r	0.007 860	0.015 171	0.015 857	0.014 299
Average	0.013 381	0.017 639	0.009 351	0.015 259

Table 8. Correctness Evaluation: Measurement Error (SPECrate 2017 Integer)

\(^1\)Let \((x_1, x_2, \ldots , x_n)\) be the metrics collected by hperf and \((y_1, y_2, \ldots , y_n)\) be the metrics collected by the baseline. The MAE is calculated by \(\frac{1}{n} {\sum _{i=1}^{n} \vert x_i - y_i \vert }\).

Table 9.

Metric	Our Tool (hperf)	Baseline (Intel VTune)
CPI	1.058 327	1.049 746
L1 Cache MPKI	33.083 516	32.883 894
L1 Cache Miss Rate	0.104 154	0.104 192
L2 Cache MPKI	15.559 695	15.338 713
L2 Cache Miss Rate	0.470 315	0.469 157
LL Cache MPKI	0.523 275	0.512 238
LL Cache Miss Rate	0.092 237	0.093 632
Branch MPKI	14.075 051	14.001 427
Branch Miss Rate	0.064 389	0.064 051
ITLB MPKI	0.001 021	0.000 392
DTLB MPKI	0.485 641	0.484 570
DTLB Walk Rate	0.001 082	0.001 087
MAE		0.043 206

Table 9. Correctness Evaluation: Comparison of Performance Metrics (505.mcf_r in SPECrate 2017 Integer, Machine B)

8.4 Comparative Analyses of Efficiency

8.4.1 Tools for Comparison.

In this subsection, we selected two profiling tools for comparison of efficiency to our tool—LIKWID [43] and ARM Top-down Tool [31, 36], for comparisons on x86-64 and AArch64 machines, respectively. LIKWID is a suit of command-line profiling utilities widely used for collecting performance counter metrics on x86-64 systems. It is open source so the definition of event groups can be obtained. It has predefined event groups for time-based group switching. Intel VTune/EMON is closed source and it is impossible to obtain the definition or generating strategy for event groups, so that cannot be used for comparison.

There is no universally recognized microarchitecture performance measurement tool on AArch64 systems. Therefore, we chose ARM Top-down Tool for comparison, which is officially developed by ARM. It will generate event groups based on the metrics to be measured dynamically.

Note that many state-of-the-art works are implemented on top of PAPI [5, 42], a user-mode-implemented performance monitoring interface. Users can define various event groups (called EventSets in PAPI) based on the metrics to be measured and program these event groups to take turns using PMCs. However, PAPI does not have predefined event groups for metrics, like LIKWID or ARM Top-down Tool.⁷ Since we aimed to compare different event groups for evaluating efficiency, we did not choose PAPI for comparison.

8.4.2 Methodology.

We selected typical and commonly used event groups in these tools to conduct measurements and optimized them for efficiency using our proposed method of adaptive grouping. We tried to align the event groups of the two tools. However, we could not do it perfectly for different platforms support different events. To evaluate the efficiency of the measurements, we focused on two aspects—sampling ratios of performance metrics and PMC utilization.

8.4.3 Optimization over LIKWID.

We selected six commonly used predefined event groups in LIKWID and conducted comparative experiments on machine A. The selected event groups and their events are listed in Table 10. Each event group defines its performance metrics. The metrics are derived only from the events in their own event groups.

Table 10.

Group	Event
(ALL)\(^1\)	instr_retired.any*
	cpu_clk_unhalted.core*
	cpu_clk_unhalted.ref_tsc*
DATA	mem_inst_retired.all_loads
DATA	mem_inst_retired.all_stores
BRANCH	br_inst_retired.all_branches
BRANCH	br_misp_retired.all_branches
TLB_INSTR	itlb_misses.causes_a_walk
TLB_INSTR	itlb_misses.walk_active
TLB_DATA	dtlb_load_misses.causes_a_walk
	dtlb_store_misses.causes_a_walk
	dtlb_load_misses.walk_active
	dtlb_store_misses.walk_active
L2CACHE	l2_trans.all_requests
L2CACHE	l2_rqsts.miss
L3CACHE	mem_load_retired.l3_hit
	mem_load_retired.l3_miss
	uops_retired.all

Table 10. Selected Predefined Event Groupsin LIKWID

\(^1\)These three events monitored by fixed PMCs are included in every predefined event group in LIKWID.

LIKWID implements its own scheduling policy. In each scheduling interval, only one event group occupies underlying PMCs. Therefore, the average sampling ratio of the six event groups and their metrics is 16.67% (1/6) and the PMC utilization for measurement is 78.57% (33/42).

According to the method of adaptive event grouping, because the number of available generic PMCs detected is four on machine A, Group DATA (2 events) and Group BRANCH (2 events) can be merged. Group TLB_INSTR (2 events) and Group L2CACHE (2 events) also be merged. The process of optimizing event groups is shown in Figure 8.

Fig. 8.

After the optimization, the number of event groups was reduced from 6 to 4. The experimental results indicated that the sampling ratios for each event group and each metric were all improved to 25.00% (1/4). In addition, the PMC utilization in this measurement is improved to 96.43% (27/28).

8.4.4 Optimization over ARM Top-down Tool.

Unlike LIKWID, ARM Top-down Tool generates event groups according to metrics, i.e., a metric corresponds to an event group. We selected seven commonly used categories of metrics in the tool and conducted comparative experiments on machine D. There are 14 event groups and 15 distinct events, which are listed in Table 11.

Table 11.

Category	Metric	Event
Cycle Accounting	Frontend Stalled Cycles	cpu_cycles* stall_frontend
Cycle Accounting	Backend Stalled Cycles	cpu_cycles* stall_backend
General	Instructions Per Cycle	cpu_cycles* inst_retired
Branch Effectiveness	Branch Miss Ratio	br_retired br_mis_pred_retired
Branch Effectiveness	Branch MPKI	br_mis_pred_retired inst_retired
L1I Cache Effectiveness	L1I Cache Miss Ratio	l1i_cache l1i_cache_refill
L1I Cache Effectiveness	L1I Cache MPKI	l1i_cache_refill inst_retired
L1D Cache Effectiveness	L1D Cache Miss Ratio	l1d_cache l1d_cache_refill
L1D Cache Effectiveness	L1D Cache MPKI	l1d_cache_refill inst_retired
L2 Cache Effectiveness	L2 Cache Miss Ratio	l2d_cache l2d_cache_refill
L2 Cache Effectiveness	L2 Cache MPKI	l2d_cache_refill inst_retired
LL Cache Effectiveness	LL Cache Read Miss Ratio	ll_cache_rd ll_cache_miss_rd
	LL Cache Read Hit Ratio	ll_cache_miss_rd ll_cache_rd
	LL Cache Read MPKI	ll_cache_miss_rd inst_retired

Table 11. Selected Predefined Event Groups in ARMTop-down Tool

ARM Top-down Tool is implemented based on Linux perf, so it follows the scheduling policy of Linux perf_event when multiplexing PMCs. The experimental results indicated that the sampling ratios for each metric are 21.42% (3/14) on average and the PMC utilization is 85.71% (84/98).

It is worth noting that there is a considerable overlap of events across these event groups; every event group contains only two events. There is plenty of space for the optimization of efficiency. Given that the number of available generic PMCs detected is six on machine D, we merged as many event groups as possible. The optimization process is also shown in Figure 8. After the optimization, only three event groups were left. The experimental results indicated that the sampling ratios for each event group and performance metric are improved from 21.42% (3/14) to 33.33% (1/3). However, the PMC utilization is the same as the original predefined event groups. They are both 85.71% (18/21).

8.4.5 Discussions.

The improvement in efficiency through the optimization we proposed is summarized in Figure 9. The comparison shows that the optimization brings improvement by around 50% in sampling ratios of metrics. This leads to a significant improvement in the quality of performance metrics when multiplexing PMCs.

Fig. 9.

For LIKWID, the optimization brings improvement by around 23% in PMC utilization. While for ARM Top-down Tool, there is no improvement. The results indicate that the optimization does not necessarily improve PMC utilization. For AArch64 processors, there is only one fixed PMC and the impact of the scheduling pitfall of Linux perf_event is insignificant. It is possible to leverage PMCs in most passes of scheduling.

Nevertheless, the optimization reduces the number of event groups and improves the sampling ratios for metrics. It is meaningful, especially for cloud service profiling. In cloud or serverless scenarios, the profiling overhead should be tolerable and the scheduling interval cannot be too short to prevent other normally running applications from being interfered with. In addition, sufficient performance data need to be collected in as short a time as possible. Fewer event groups are beneficial for the performance measurement for these scenarios.

For example, given the scheduling interval of 1 second and the measurement time of 1 minute, if we use LIKWID for collecting the same performance metrics as in previous experiments, then we will collect 10 samples for each metric, because there are six event groups. While if we used the optimized event groups, then we will collect 15 samples for each metric, since the number of event groups has been reduced to 4. Therefore, the improvement significantly benefits the efficiency of measurement and the quality of subsequent analysis.

8.4.6 Limitations.

Note that our method of adaptive grouping is based on the assumption that event constraints do not exist. When merging selected event groups, event constraints may exist within the merged event group, in which case the event groups cannot be merged. So far, the criterion for determining whether two event groups can be merged is that the number of merged events does not exceed the number of available PMCs (line 3 in Algorithm 2). Therefore, our approach can be enhanced by further exploring the flexibility to handle event constraints.

In addition, we did not analyze the scheduling policies and event grouping strategies of closed source measurement tools for efficiency comparisons, such as Intel VTune/EMON. However, it is possible to analyze them through other indirect approaches. For example, a worth attempting approach is periodically reading PMU’s control registers to see which event is configured and inferring the scheduling.

9 Related Works

To solve the contradiction between too many events and too few PMCs, there are mainly two types of workarounds—multiple measurements and multiplexing.

For multiple measurements, we need to conduct multiple independent measurements to cover all required events. This approach can obtain the most accurate representation of workload characteristics, since the events are continuously monitored during workload running so no estimation is involved [32]. The approach is applicable to some reproducible workloads with stable characteristics.

The key problem to be solved for this approach is trace alignment. M. Hauswirth et al. [20] proposed an algorithm for aligning time series of performance metrics collected through different measurements. R. Neill et al. [38] proposed a technique to reconcile and merge multiple profiles by analyzing task creation graphs for parallel programs.

This approach is inefficient when the number of events far exceeds the number of available PMCs [53]. In addition, the approach is inapplicable to real-life applications, such as cloud services and micro-service applications. Unlike benchmarking, the applications are mostly difficult to reproduce. Besides, performance degradation or failures may occur in a pretty short time. Comprehensive profiles need to be collected with minimal overhead to locate problems. Therefore, we believe multiplexing is more widely applicable than multiple measurements in such scenarios. This article mainly focuses on multiplexing and proposes a series of methods to improve measurement efficiency.

For multiplexing, researchers are primarily concerned with improving the accuracy of performance measurements. It is related to two topics: the scheduling policy and the estimation.

The scheduling policy. The scheduling policies for multiplexing PMCs are usually designed for fairness, i.e., guaranteeing equal opportunity to occupy PMCs for each event or event group. Hence, most profiling tools and interfaces adopt round-robin style scheduling policies for multiplexing PMCs.

However, the specific scheduling implementations are slightly different, depending on whether or not multiple event groups are allowed to occupy PMCs in a pass of scheduling. For example, PAPI [5], MPX [34], and Perfmon2 [16] are performance monitoring interfaces for Linux, and they support user-defined event groups to multiplexing PMCs. In each pass of scheduling, event groups take turns occupying the PMCs. Unlike Linux perf_event, only one event group can be monitored by PMCs in a pass of scheduling, which means even if there are still enough available PMCs for other event groups, they will not be scheduled. Therefore, efficient event groups are critical for the users of these interfaces. LIKWID also adopts this scheduling policy and thus has the limitation of inefficient predefined event groups.

Although the scheduling policy of Linux perf_event is also round-robin style, it will assign as many event groups as possible to available PMCs in a pass of scheduling. That means there may be multiple event groups in a pass of scheduling. To some extent, this improvement mitigates the problem of PMC underutilization due to inefficient event groups.

Besides round-robin style, there are some other scheduling policies. To avoid blind spots due to the coincidence of the scheduling period and a loop iteration, Azimi et al. [2] randomize the order of event groups in each scheduling period. In addition, PAPI provides various supports for accessing PMCs and makes user-mode-implemented scheduling policy possible. Lim et al. [27] implemented a scheduling policy for efficiency based on PAPI, by setting priority for events based on the rate of change of event counts.

The scheduling for multiplexing PMCs can be implemented in user or kernel modes. For example, PAPI and MPX implement the scheduling in user mode. However, Azimi et al. [2] tried to implement a scheduling policy in kernel mode to alleviate the problem of excessive scheduling overhead under fine-grained sampling. Perfmon2 also implements scheduling in kernel mode for similar reasons.

After perf_event subsystem was officially introduced in Linux kernel, many user-mode-implemented interfaces and tools have switched to using Linux perf_event interfaces [47, 48], such as PAPI. As a result, the features of Perfmon2 were partially replaced by Linux perf_event, and they are now deprecated. Some of its features have been preserved as libpfm4, which is still currently used. The library helps convert from a platform-independent event name to a platform-dependent event encoding, which facilitates the development of cross-platform profiling tools.

Our approach is designed and implemented based on Linux perf_event subsystem. Dimakopoulou et al. [15] summarized the scheduling policy of Linux perf_event on x86-64 systems and fixed the problem of corrupting events with Hyper-Threading enabled on Intel processors by introducing cross hyper-thread dynamic event scheduling. Based on their work, we went further to investigate the scheduling policy, by considering fixed PMCs and extending to ARM-based systems. In addition, we revealed and mitigate the pitfall of no distinction between fixed PMC events and generic PMC events.

The estimation. May [34] proposed the linear scaling method to estimate actual counts when multiplexing PMCs for measurement, which is adopted by Linux perf. Banerjee et al. [4] demonstrated that higher sampling ratios contribute to more accurate estimation. Based on that, our methods are designed to improve the sampling ratios of metrics.

Based on the linear scaling method, researchers have proposed several estimation methods for accuracy. For example, Mathur et al. [33] proposed various time-interpolation-based methods. Based on that, Wang et al. [46] improved these methods by adopting non-linear models. Mytkowicz et al. [37] proposed an approach for evaluating the accuracy of these methods. Our approach does not mainly focus on estimation, thus we chose the linear scaling to evaluate our approach. We believe that improvements in measurement efficiency can benefit the estimation.

Microarchitecture performance data is applied in a wide range of applications, including workload characterization [17, 40, 45], performance analysis [49, 50], and performance optimization of applications [21, 41, 54] and compilers [7, 35]. In recent years, PMCs have also been applied for system security, such as malware detection and side-channel attack [14]. In this article, we do not address the specific applications of microarchitecture performance data but concentrate on measurement.

10 Conclusion

In this article, we proposed an approach for cross-platform microarchitecture performance measurement via adaptive grouping. The approach generates event groups based on the number of available PMCs detected while avoiding the scheduling pitfall of Linux perf_event subsystem. We demonstrated the feasibility and effectiveness of our methods on current mainstream processors across x86-64 and AArch64. The evaluation indicated that our approach brings improvement of the average samplings ratio by around 50%, compared to other state-of-the-art tools. Our approach has the potential to be applied in cloud and serverless profiling, especially for short-lived and latency-sensitive applications. In addition, it can be easily adopted by other profiling tools or interfaces.

Our approach still has several limitations and requires more in-depth studies in the future. First, we plan to mitigate the scheduling pitfall of Linux perf_event in kernel mode by modifying the code implementation of the kernel. Second, we will investigate closed source measurement tools’ scheduling policies and grouping strategies for multiplexing PMCs, such as Intel VTune/EMON. Third, we will try to enhance the algorithm for generating adaptive event groups by further exploring the flexibility to handle event constraints.

Footnotes

In this article, x86-64 refers specifically to the instruction set architecture of Intel processors.

Available on https://jihulab.com/solecnu/hperf

For Intel processors, Hyper-Threading affects the number of PMCs per processor. For example, Intel Cascade Lake processors have four generic PMCs per processor when Hyper-Threading is enabled and eight generic PMCs when Hyper-Threading is disabled. Since enabling Hyper-Threading is the default configuration, all discussions and experiments are based on the precondition of Hyper-Threading enabled for Intel processors in this article.

⁴

There is an exception: x86-64 processors (including Intel and AMD) support the rdpmc instruction [1, 9]. It is possible to read data from PMCs directly in user mode [47].

⁵

For example, cycles is the alias for cpu_clk_unhalted.thread (event=0x3c, umask=0x00) on Intel platform and cpu_cycles (0x11) on ARM platform. instructions is the alias for inst_retired.any (event=0xc0, umask=0x00) on Intel platform and inst_retired (0x08) on ARM platform. In this article, we sometimes use generalized event names for simplicity.

⁶

This feature can be used in Linux perf when specifying the events to be monitored using the -e option. We can use the :D modifier to set an event to pin to a PMC. For the convenience of description, we use an asterisk (*) to denote pinned events in this article. This mechanism is also essential to mitigate the pitfall of the event scheduling policy in Linux perf_event. We will discuss this in detail in Section 6.

⁷

PAPI defines various platform-independent derived events, one of which may be derived from multiple platform-dependent events. However, they are different from the event groups discussed in this article. For example, Total L1 Cache Misses (PAPI_L1_TCM) is a derived event and defined as the sum of L1 Data Misses (l1d_cache_refill) and L1 Instruction Misses (l1i_cache_refill) on AArch64 platforms. Two PMCs are required to monitor the derived event PAPI_L1_TCM. In this article, events are not derived and only require one PMC to be monitored and event groups are designed for derived metrics.

References

[1]

Advanced Micro Devices, Inc.2023. AMD64 Architecture Programmer’s Manual, Volumes 1-5. Retrieved June 18, 2023 from https://www.amd.com/en/support/tech-docs/amd64-architecture-programmers-manual-volumes-1-5

Abstract

1 Introduction

2 Preliminaries

2.1 Performance Monitoring Units

2.2 Linux perf_event Subsystem

2.3 Performance Metrics

2.4 Event Groups

3 Motivating Examples

3.1 Irregular Sampling Ratios

3.2 Inefficient Event Groups

4 Approach Overview

5 Detection of Available PMCs

5.1 Method Ideas

5.2 Steps of the Detection

5.2.1 Initialization.

5.2.2 Extension.

5.3 Handling Event Constraints

6 Scheduling Policy of Linux Perf_event

6.1 Hierarchical Scheduling Policy

6.2 The Scheduling Pitfall

6.3 The Mitigation

7 Adaptive Event Grouping

8 Evaluation

8.1 Experimental Setups

8.2 Experiments on Available PMC Detection

8.2.1 Detection for AArch64 Processors.

8.2.2 Detection for x86-64 Processors with NMI Watchdog Enabled.

8.2.3 Detection for x86-64 Processors with Event Constraints.

8.2.4 Discussions.

8.3 Evaluation for Correctness

8.3.1 Baseline.

8.3.2 Experimental Results.

8.4 Comparative Analyses of Efficiency

8.4.1 Tools for Comparison.

8.4.2 Methodology.

8.4.3 Optimization over LIKWID.

8.4.4 Optimization over ARM Top-down Tool.

8.4.5 Discussions.

8.4.6 Limitations.

9 Related Works

10 Conclusion

Footnotes

References

Index Terms

Recommendations

Efficient superscalar performance through boosting

Efficient superscalar performance through boosting

Adaptive predication via compiler-microarchitecture cooperation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations