1. Introduction
Graph data structures have been used in a wide range of scientific applications, and engineers and scientists analyze graph data to discover knowledge and insights by using various graph algorithms. Thus, it is important to select appropriate graph algorithms for graph analytic processes and facilitate the process. In many complex graph algorithms, breadth-first search (BFS) is a key building block of them. The BFS is an iterative algorithm, which is initiated on a source vertex and visits all its neighbors. In the following iteration, all vertices visited at the previous iteration become new sources of that iteration, and all neighbors of them are visited. The algorithm is terminated when all reachable vertices are visited.
Figure 1 presents how BFS works. This iterative procedure consists of the basic operations such as inspecting neighbors of a vertex and expanding a set of frontiers (i.e., vertices visited at the previous iteration) of each iteration. These two operations are also applied to algorithms such as label propagation, which can be used to detect fraud in commercial transactions [
1], and PageRank, which can be used to measure the objective reputation of a certain website [
2]. Thus, it is crucial to understand the BFS execution mechanism and characteristics to enhance the performance of BFS itself as well as expand its use to other graph algorithms.
The performance of a BFS in GPU environments can be enhanced further by exploiting parallelism [
3,
4,
5,
6,
7,
8,
9]. For example, the workload used to inspect the neighbors is proportional to the number of edges to be checked at each iteration, and the operation can be parallelized. Thus, using thousands of GPU cores for parallelism helps the graph traversal inspect the neighbors with a reduced execution time, achieving a high level of performance.
Even if the parallelism of the GPU helps the overall performance of a BFS, such performance could still be degraded by the huge workload at some parts of the entire traversal. The workload of a BFS is determined by the number of frontiers and can therefore fluctuate. In their well-known publication [
10], Beamer et al. observed that the conventional BFS algorithm used to inspect the neighbors of the frontiers is ineffective when the number of frontiers is large. To address this problem, the authors proposed a direction-optimizing BFS that adopts two variations (i.e., directions) of a BFS algorithm: a push (conventional BFS) and a pull. In the pull phase, the active vertices are those not visited until the current iteration is reached, unlike in the push phase, and they are checked to determine whether their neighbors are frontiers. Consequently, an inspection can succeed with a high probability if the number of frontiers is large. Because it showed a performance enhancement in terms of the BFS execution, the direction-optimizing scheme has become popular with BFS schemes.
However, the determination of the direction of a BFS at each iteration or criteria of the selecting directions has not been discussed thoroughly. Many state-of-the-art graph processing frameworks [
4,
8,
10] include a direction-optimizing BFS. However, their methods for determining the direction are naive and have drawbacks in their direction selecting decisions. Because these methods do not handle the varying workload efficiently, they make inappropriate directional decisions at the dataset level (i.e., different datasets) and phase level (i.e., within a single BFS execution). Consequently, the performance can easily degrade. Moreover, the methods yield a high computational overhead, and the overall execution time of the BFS is increased.
In this study, we introduce four workload state metrics and analyze the effect of each metric on the performance of a BFS. Based on our study on new metrics, we propose a novel direction-optimizing method, called Selecting the direction Upon Recent workload of Frontiers (SURF), that handles the workload of each iteration of execution with consideration of new metrics that represent the workload states of the frontiers. The proposed SURF utilizes the metrics as features of the MLP model to predict the label of the direction. The contributions of our study are as follows:
We propose new workload state metrics and analyze their impact based on theoretical proofs.
We propose a novel direction-optimizing scheme, i.e., SURF, based on the observation of the new metrics and provide the source code of our proposed scheme, which is publicly available at
https://github.com/kljp/SURF/ (accessed on 18 March 2022).
To measure the effectiveness of the proposed SURF, we provide thorough experiment results using public datasets collected from various perspectives.
The remainder of this paper is organized as follows.
Section 2 introduces the background of a direction-optimizing BFS and previous studies related to this topic.
Section 3 introduces the workload state metrics and provides theoretical proof to observe their impact on the performance of BFS.
Section 4 presents the details of the proposed SURF implementation.
Section 5 presents the experiment results, focusing mainly on comparisons between SURF and existing approaches. Finally, we provide some concluding remarks in
Section 6.
3. Can “Workload State” Help Achieve Efficient Direction Selections in BFS?
The workload changes at each iteration of the BFS algorithm execution, and we hypothesize that such changes can be used to achieve efficient direction selections in a BFS. In this section, we define four metrics that describe workload state and give the physical meaning of them. Using these metrics, we describe our hypotheses and provide a logical proof.
The changes defined as the “workload state” in this study are used to represent the fluctuation level of the number of frontiers (). The workload state is significantly affected by the increase in , and the sharpness of changes. Because the performance of a push and pull phase depends largely on the change in in the direction-optimizing BFS, it is important to understand the behavior of the changes in the workload state for selecting the traversal direction. In this study, we introduce four metrics derived from to illustrate the relationship between the workload state and the traversal direction.
Let
be the number of frontiers at iteration
x. We then define
as the variation of
; in particular,
denotes the variation of
from iteration
to
x (
), as follows:
We determine whether
increases by observing the sign of
(i.e., a positive value indicates an increase, and a negative value indicates a decrease).
Figure 3a shows a plot after iteration
k. A positive
implies that
has increased, whereas a negative
implies that
has decreased. Based on this observation, we found that the metric
can be related to the performance of the push phase (i.e., using
in the push phase to achieve a higher performance).
Observation 1. As increases, the cost of the push phase also increases.
Let
be the maximum number of threads assigned to the physical cores of the GPU. It is assumed that each thread takes a frontier from the frontier set. In the CUDA [
11] architecture, 32 threads are executed as a warp unit. To execute the next instruction, all 32 threads must complete the current instruction. Thus, each warp must complete work on the current 32 frontiers before taking the next 32 frontiers (synchronization barrier). Based on this principle, we define
as the cost of the synchronization barrier at iteration
x because each warp requires
times of synchronization barrier to finish the job at iteration
x [
9]. We then define the increased cost from iteration
to
x as follows:
Cost increases as increases. Therefore, the cost of the push phase increases as increases.
Nevertheless,
only helps to determine whether
increases. To understand in detail how the workload changes, we should determine how sharply
changes or whether the sign of
has changed. Thus, we also define
as the variation of
, where
denotes the variation of
from iteration
to
x (
), as follows:
Metric
represents the convexity of the graph for
on a Cartesian plane. Similar to
, we determine the direction in which the graph for
is convex by observing the sign of
(i.e., positive is convex downward and negative is convex upward).
Figure 3b shows a plot of the traversals in which the convexity of the graph changes in a row after iteration
k. The graph shows that
decreases after iteration
. Thus, the shape of the graph appears to be convex upward. However,
increases after iteration
, and thus the graph has a convex downward shape. Based on this observation, we found that the metric
can be related to the performance of the push phase, such as
.
Observation 2. As increases, the cost of the push phase also increases.
From the definition of
, we deduce another equation as follows:
In this equation, is constant because the value of has already been determined at iteration . Thus, cost increases as increases, i.e., the cost of the push phase increases as increases.
To determine the ratio of the current frontiers of the entire traversal, we define as the workload occupancy at each iteration, where is the number of vertices in a graph. The metric is related to the performance of the pull phase.
Observation 3. As increases, the performance of the pull phase also increases.
We define the probability that the inspected neighbor is the frontier as follows:
Let
be the degree of the
kth vertex among unvisited vertices. We then define the probability that the parent of that vertex is discovered at the current iteration as follows:
We then define the probability that the parents of all unvisited vertices are discovered at the current iteration, where
is the number of unvisited vertices at iteration
x (the current iteration), as follows:
Thus, increases with . Because the pull phase is effective when as many parents as possible are discovered, based on the definition of , the performance of the pull phase increases as increases. Furthermore, is proportional to . Therefore, the pull phase becomes more effective when increases.
In addition, we define
as the number of unvisited vertices, and
denotes the value of
at iteration
x; thus,
is defined as follows:
We also define to indicate how many vertices remain to be explored at the following iterations as ratio, particularly at iteration x. Thus, we can inspect where the current iteration is located using , for example, whether the current iteration is located in the early, middle, or late stages of the entire traversal. Based on this observation, we found that metric is related to the performance of the pull phase.
Observation 4. As decreases, the performance of the pull phase increases.
Let
t be the time required to inspect one neighbor. Using
t, we define
, the expected time to inspect the neighbors to find a parent of the
kth vertex among unvisited vertices at the current iteration as follows:
Thus, the higher
is, the higher
becomes. It is assumed that each thread takes a vertex from the unvisited vertices. We then define
as the number of vertices that each vertex must process. Using
, we also define
T, the time expected to process all unvisited vertices at iteration
x, where
is the integer interval
, as follows:
Regardless of each fraction of the sum, the smaller is, the smaller T becomes. Therefore, the running time of the pull phase decreases as decreases ().
In summary, we classified the four workload state metrics into two categories based on their relationship with the traversal direction. First, the fluctuations in and are directly related to the performance of the push phase, as presented in Observations 1 and 2. They are used to describe the variation in during the entire traversal process and are sensitively affected by the degree of change in the workload, such as explosive increases or drastic decreases in . Thus, it is important to determine whether the push phase is sufficiently effective at handling the current workload using and . Second, and can be applied as important factors for measuring the performance of the pull phase. The metric may be crucial for determining the effectiveness of the pull phase in terms of the success of the parent inspection, as presented in Observation 3. By contrast, metric is crucial for determining the effectiveness in terms of the length of the running time of the pull phase, as presented in Observation 4. We can therefore conclude that and are crucial parameters for measuring the expected performance during the pull phase.
By observing the workload state with , , , and , the BFS algorithm can determine which traversal direction is more effective for the performance at the current iteration. Our proposed direction-optimizing method, SURF, adopts these workload state metrics as features for predicting the BFS direction. In the next section, we describe the details of SURF implementation.