Open AccessArticle

Efficient Graph Representation Learning by Non-Local Information Exchange^†

Departments of Computer Science and Psychiatric, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Wei, Z.; Wu, G. Non-local Exchange: Introduce Non-locality via Graph Re-wiring to Graph Neural Networks. In Proceedings of the NeurIPS 2024 Workshop on Behavioral Machine Learning, Vancouver, BC, Canada, 14 December 2024.

Electronics 2025, 14(5), 1047; https://doi.org/10.3390/electronics14051047

Submission received: 4 February 2025 / Revised: 28 February 2025 / Accepted: 2 March 2025 / Published: 6 March 2025

(This article belongs to the Special Issue Artificial Intelligence in Graphics and Images)

Download

Browse Figures

Figure 1
The relationship between the expressibility of the original and 3 underlying topologies of graphs and modern GNN performance (in node classification accuracy). Different landmarks represent different datasets. Colors denote graph re-wiring methods. Red arrow lines highlight the improvement by our re-wiring method. Red box explains ours preduces an easier graph to classify via changing the topology as nodes with same class denoted by colored oval being more separated. Note that all re-wiring methods are applied with the same baseline hyperparameter. "> Figure 2
Non-local information exchange mechanism (right), where colors of node denote the distance marked by numbers between a node to the red one, nodes with mixed color denote aggregated node feature by message-passing, solid lines are edges of graph, and dashed lines denote express connections. The technique reminiscent of non-local mean technique for image processing (left), which is able to capture global information by express connections that are denoted by red dashed lines reducing the over-smoothing risk in GNNs. Both ideas integrate information beyond either a spatial or topological neighbor, in order to preserve distinctive feature representations. "> Figure 3
(left): Illustration of progressive NLE for simulated graph with original adjacency matrix <math display="inline"><semantics> <mi mathvariant="bold">A</mi> </semantics></math> to re-wired topology <math display="inline"><semantics> <mrow> <mi>h</mi> <mi>o</mi> <mi>p</mi> <mo>(</mo> <mi mathvariant="bold">A</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> </semantics></math> (<math display="inline"><semantics> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>⋯</mo> </mrow> </semantics></math>). (right): ExM sorts original graph and new graphs cascaded (C-ExM) or aggregated (A-ExM) to input to any GNN. Green arrow indicates the pipeline of an arbitrary GNN. "> Figure 4
Comparison of re-wiring methods between DropEdge, GDC, and our NLE. NLE can mitigate over-smoothing issues. Compared with previous graph re-wiring methods, over-smoothness is delayed after using NLE. Even though G2GNN or using skip connection almost eliminated smoothed node features, using NLE leads to a larger Dirichlet energy than the original graph topology. "> Figure 5
Bar plots of performance by using different layer numbers on real data. ">

Review Reports Versions Notes

Abstract

Graphs are an effective data structure for characterizing ubiquitous connections as well as evolving behaviors that emerge in inter-wined systems. Limited by the stereotype of node-to-node connections, learning node representations is often confined in a graph diffusion process where local information has been excessively aggregated, as the random walk of graph neural networks (GNN) explores far-reaching neighborhoods layer-by-layer. In this regard, tremendous efforts have been made to alleviate feature over-smoothing issues such that current backbones can lend themselves to be used in a deep network architecture. However, compared to designing a new GNN, less attention has been paid to underlying topology by graph re-wiring, which mitigates not only flaws of the random walk but also the over-smoothing risk incurred by reducing unnecessary diffusion in deep layers. Inspired by the notion of non-local mean techniques in the area of image processing, we propose a non-local information exchange mechanism by establishing an express connection to the distant node, instead of propagating information along the (possibly very long) original pathway node-after-node. Since the process of seeking express connections throughout a graph can be computationally expensive in real-world applications, we propose a re-wiring framework (coined the express messenger wrapper) to progressively incorporate express links in a non-local manner, which allows us to capture multi-scale features without using a very deep model; our approach is thus free of the over-smoothing challenge. We integrate our express messenger wrapper with existing GNN backbones (either using graph convolution or tokenized transformer) and achieve a new record on the Roman-empire dataset as well as in terms of SOTA performance on both homophilous and heterophilous datasets.

Keywords:

graph re-wiring; graph learning; representation learning; non-local; graph neural networks

1. Introduction

The graph is a universal data structure for modelling complex relations in the real world, from social network connections to protein interactions. In the past decade, remarkable strides have been made in the development of graph neural networks (GNNs) for node classification [1], link prediction [2], and graph recognition [3]. Along with the trends in computer vision, the research focus has shifted to transformer-based backbones following the widespread adoption of graph convolution techniques [4,5], where global attention mechanisms [6] demonstrate great potential for alleviating longstanding issues in GNNs such as over-smoothing and poor long-range feature dependencies [7]. To address the computational challenges of large-scale graphs, tokenized graph transformer [8] models have been introduced recently where the graph token is localized either at the node level [9] or within a sub-graph [10].

Despite various deep models for graph learning, current GNNs are de facto closely bonded under the overarching umbrella of a topological message-passing paradigm [11,12]. Given the observed graph feature representations, the driving factor of GNNs is to learn intrinsic feature representations by alternatively (1) seeking for an optimal feature subspace and (2) aggregating the information within a certain graph neighborhood, where the learned feature representations are supposed to have a better alignment with the existing labels (supervised manner [13]) or exhibit a more structured behavior (unsupervised manner [14]). Since the node-to-node information exchange fundamentally underlines the graph topology, the wiring pattern presented in the graph becomes a pivotal factor steering the potential of a graph to be expressed, which can be called the expressibility of nodes. Ref. [15] found that this potential of node classification in a graph is determined by the homophily metric ratio representing if an edge likely connects nodes with the same label, where a graph with a high homophily ratio is considered homophilous and the opposite is heterophilous. Methods have been developed for homophilous and heterophilous graphs based on information aggregating and gating, respectively, if not considering graph re-wiring [16].

Take the latest tokenized graph transformer NAGphormer [9] as an example, as shown in Figure 1; we fit the relationship between node expressibility on nine popular graph datasets (Roman, Texas, Wisconsin, Squirrel, Chameleon, Amazon, CiteSeer, PubMed, and Cora) and node prediction accuracy. Specifically, we sort these nine graph datasets based on a homophily metric h ratio. To demonstrate the pivotal role of topology in GNNs, we employ a proxy measurement of expressibility for the underlying graph structure (expalined in Section 2.2). In Figure 1, we use two existing graph-re-wiring methods: DropEdge [17] and GDC [18]. As explained in Section 2.1, DropEdge seeks to prevent overfitting issues by removing graph edges, while GDC resorts to adding edges. In spite of different re-wiring strategies, the fitted relationship between node expressibility and node classification accuracy indicates a strong correlation pattern—that is, the higher the node expressibility, the greater the potential for improved node classification performance. Following this clue, we speculate that an alternative way to improve GNN performance is seeking for a more informative graph topology. As summarized in Table 1, we propose a non-local exchange manner of graph re-wiring to address problems of over-smoothing and over-squashing via increasing the node expressibility previous to model training. In contrast to previous graph re-wiring methods, NLE performs the re-wiring incorporating the original topology as a non-local means for images.

Back to the cliché of GNNs, there is a converging consensus that the message-passing mechanism allows us to capture global graph feature representations, in a layer-by-layer fashion, by progressively aggregating the feature representations from the nearby nodes to the distant nodes. Limited by the pairwise links in the graph, high-order network models such as hyper-graph techniques use hyperlinks to model relationships among multiple nodes. However, most graph applications are conducted on one-simplex graphs [19]; however, the cost of such node-after-node message-passing is over-smoothed feature representations due to an excessive number of feature aggregations along the pathway between two distant nodes on the graph [20,21].

Inspired by the non-local means technique in patch-based image processing [22], as shown in Figure 2, we introduce a non-local exchanging (NLE) mechanism to the field of GNNs by re-wiring the original graph with express connections for distant node pairs. The conventional message-passing is straggled with over-smoothing as demonstrated in the top-right panel, where local features (centered at node

# 0

and

# 1

) have to be aggregated multiple times until the message-passing process reaches node

# 3

. In this context, the combination of local and global information by a conventional node-after-node message-passing mechanism often yields over-smoothed feature representations which are mixed with all the information spanning the entire graph pathway. This over-smoothing happens to image processing without non-local techniques, as shown in Figure 2 on the bottom left. Thus, we migrate the same idea of a non-local approach from image to graph. In the bottom right of Figure 2, the selected express connection not only allows us to effectively capture global information but also offers a new window to maintain the distinctive power of the underlying feature representations.

It is worth noting that our idea of express connection is simple yet effective. The conjecture is that the graph re-wiring step enhances the expressibility of topology, thus enabling GNNs to learn informative global features prior to the incoming features undergoing significant smoothing by a conventional message-passing routine. To that end, a deep network architecture is no longer the only option to capture global feature representations, thus reducing the over-smoothing risk in GNNs. In this context, we further devise a hierarchical re-wiring wrapper, called the express messenger (ExM), which naturally fits the layer-by-layer network architecture of GNN models.

The SOTA performance on various graph applications demonstrates that the novel NLE mechanism unleashes the power of current GNN models, including graph convolution networks (GCNs) and graph transformer models. Recall the empirical evaluation between graph expressibility and node classification accuracy by DropEdge and GDC in Figure 1. Our NLE produces a more informative graph topology with the highest expressibility (blue line) over the counterpart re-wiring methods (pink and green lines) and the original graph (black dash line), yielding a higher node classification accuracy.

Our main contributions are summarized below:

We propose a non-local information exchange mechanism to efficiently integrate feature representations from non-local neighborhoods with a collection of express connections.
Current works focus on the optimization of feature representation with a deep GNN model by overcoming the over-smoothing issue. We address this challenge from the novel perspective of expressibility—i.e., the insight of our NLE mechanism is to directly combine local and global information through express links before the over-smoothed features undermine the discriminative power of feature representation.
We devise our ExM wrapper as a pre-processing step, which generates new topologies in advance and facilitates various GNN models to retain SOTA performance on both homophilous and heterophilous graph data.

2. Preliminaries

Let graph

G = {V, E}

, where

V

is node set and

E

is edge set. The strength of node-to-node connectivity is encoded in a weighted adjacency matrix

A

. Specifically, we use

N_{i}^{k}

to denote the set of the graph neighborhood of nodes where the topological distance with respect to the i-th node is k-hops on the graph

G

. Note that

N_{i}^{0}

indicates direct-connected nodes from the i-th node without any hops. Following the notion in the GNN literature, the latent graph feature representations at the l-th layer of neural network is denoted by

H^{(l)}

2.1. Related Works

In general, graph feature representation can be formulated by

H^{(l + 1)} = F_{W} (A, H^{(l)})

, where

F

is a model-specific learnable function parameterized by

W

. Taking a vanilla GCN [1] as an example,

F_{W} = A (HW)

alternatively (1) projects the current feature representations to a new subspace by

HW

and (2) aggregates the updated features with the graph neighborhoods

{N_{i}^{0}}_{i = 1}^{| V |}

underlying the graph topology steered by

A

. It is apparent that node features eventually become identical after a sufficient number of graph diffusion steps [23], which is the smoking gun of over-squashing and over-smoothing in GNN models [20,21].

In this regard, striking efforts have been made to preserve the local structures. Since all neural networks are driven by gradients, the gradient gating technique [24] has been proposed to alleviate over-smoothed graph features at each node by penalizing message-passing between two nodes bearing distinct features. Alternatively, there have been several interesting works addressing the same problem by re-wiring the graph topology. For instance, DropEdge [17] randomly removes a certain number of edges from the input graph at each training epoch, in order to avoid excessive message-passing. On the contrary, a fully adjacent (FA) layer is introduced in [20] where every pair of nodes is connected by an (immediate) edge. Although the FA layer is empirically effective in preventing over-smoothing by easing the bottleneck of information flow, connecting all pairs of nodes without any selection significantly contradicts the inductive bias inherent in graph data learning [25]. In between these two extreme ends, the sparse technique is used in graph diffusion convolution (GDC) [18] to obtain more localized graph convolution. More advanced edge-based combinatorial curvature is introduced in [21], which formulates message-passing in graph learning as a stochastic discrete Ricci flow (SDRF). As summarized in the bottom of Figure 2, our proposed non-local information exchange mechanism inherits the benefits of the aforementioned topology re-wiring techniques while also being more general and agnostic to improve the current GNN models instantly as a seamless pre-processing step.

Alongside the streamlined evolution of transformers in computer vision [26,27], the focus of research has shifted to enhancing long-range feature representations in graph-based transformers by capitalizing on the self-attention mechanism in the transformer backbone [6]. Specifically, the instance of a graph token can be defined at multiple levels (node, edge, or sub-graph [8,9,10,28,29]), which allows us to characterize the interactions between multi-scale graph feature representations across a wide range of neighborhood sizes. Even though local and global features exhibit distinct characteristics, current graph transformer models tend to mix these multi-scale features within the self-attention module. Notably, while the topology re-wiring technique has proven effective in vanilla GNN models, it remains unexplored in the context of graph transformer models.

2.2. Proxy Measurement for Expressibility

The one-dimension Weisfeiler–Lehman (WL) algorithm, also known as the node-wise color refinement algorithm, is a classic graph embedding approach used for testing whether two graphs are isomorphic or not [30]. In [31], it is observed that color-coded labels through the high-order WL algorithm (histogram colors) often yield a higher graph prediction accuracy. We borrowed the node label of the 1-WL algorithm to proxy the expressibility to show how node expressing changed after graph re-wiring. A step-by-step description of this proxy measurement can be found in the Appendix A.

We calculate the expressibility proxy for the re-wired graphs after applying different re-wiring methods. Due to space limits, we only display the node expressibility of the five heterophilous graphs in Table 2. Surprisingly, (nearly) none of the current re-wiring methods can enhance the node expressibility. Based on the observed relationship between node expressibility and GNN performance, the reduced expressibility degree seems to indicate worse graph learning performance after applying current re-wiring methods. Regrettably, this supposition is confirmed by experiments: The node prediction accuracy on the re-wired graph topology is consistently lower (at least equal) than training the same GNN model on the original graph, implying that the altered graph topology by current re-wiring methods actually undermines the node classification.

3. Methods

To unleash the expressibility of a given topology, we present a novel non-local information exchange mechanism which is agnostic to various GNN models. As explained in Figure 2, our core idea is to capitalize on a collection of express connections for capturing multi-scale graph feature representations, thus being free of over-smoothed features due to the conventional diffusion-based message-passing mechanism.

3.1. Graph Re-Wiring by Hierarchical Non-Local Information Exchange

Supposing the non-local search is performed throughout the entire graph, the exhaustive NLE mechanism yields a complementary graph topology

1 - A

in addition to the original graph

A

, where

1

denotes the all-one matrix. Since most intelligent systems process information in a hierarchical manner [32], we present a progressive NLE that gradually increases the hopping steps k and includes distant (but topologically connected) nodes

v_{j}

for the underlying node

v_{i}

(

i \neq j

) by a node selection function (1: selected for express connection; 0: otherwise):

h o p {(A, k)}_{i j} = \{\begin{matrix} 1 & f o r & j \in N_{i}^{k} - N_{i}^{k - 1} \\ 0 & f o r & j \notin N_{i}^{k} - N_{i}^{k - 1} \end{matrix}

(1)

where the hopping steps value k acts as the the search radius from

v_{i}

v_{j}

on the graph. Then, we can obtain the new topology as this binary adjacency matrix.

3.2. Express Messenger (ExM) Wrapper

It is evident that the original adjacency matrix

A

(node-to-node topology used in conventional GNN models) and the re-wired topology

h o p (A, k)

(non-local topology re-wired by NLE) represent completely different aspects of graph topology. In this context, we devise two topology wrappers that allow us to tailor our new NLE topology around the original

A

3.2.1. Exhaustive vs. Progressive NLE

In Figure 3, we use a simulated graph to demonstrate the re-wiring results by NLE. Specifically, the new graph topology

h o p (A, \cdot)

derived from

A

is shown on the left part of Figure 3. To clearly display how the ExM wrapper sorts the new topology with the original, two possible ways are shown on the right part of Figure 3. The benefit of these two progressive ways is the low content of information involved resulting from using one

h o p (A, \cdot)

for each layer. This has been validated on the real-world data; the increased node expressibility values are shown in the last column of Table 2. As anticipated, we observed significantly enhanced node prediction accuracy by applying our re-wired graph in experiments. To exhaustively use NLE, we simply apply the reverse of the original topology

1 - A

during our wrapping.

Cascaded Express Messenger (C-ExM) Wrapper

By fixing the hopping step k, we first input the original graph topology

A

, i.e.,

H^{(l + 1)} \leftarrow F_{W} (A, H^{(l)})

. Next, the new topology of next hopping becomes the input to learn global feature representations delivered by the express connections, i.e.,

H^{(l + 2)} \leftarrow F_{W} (hop (A, k), H^{(l + 1)})

. The step-by-step implementation is summarized along with the graphic diagram shown in the purple box of Figure 3. Note, in case the predefined layer number L is greater than the largest hopping step, the C-ExM wrapper sorts only the original topology in the GNN model. The pseudocode can be found in Algorithm 1.

Algorithm 1 Cascaded ExM wrapper

Require: Range of hopping steps

[K_{s t a r t}

K_{e n d}]

hop

(A, k) \leftarrow E q u a t i o n (), \forall k \in [K_{s t a r t}, K_{e n d}]

while

l = 1 : L

do
if l is odd or

l / 2 > (K_{e n d} - K_{s t a r t})

then

H^{(l + 1)} \leftarrow F_{W} (A, H^{(l)})

else if l is even then

H^{(l + 1)} \leftarrow F_{W} (hop (A, l / 2), H^{(l)})

end if
end while

Aggregated Express Messenger (A-ExM) Wrapper

Instead of alternatively applying the new topology in the C-ExM wrapper, we use the original topology

A

along with the re-wired topology

h o p (A, k)

in parallel. As the graphic diagram shown in the brown box of Figure 3 indicates, we combine the output from two topologies by elementwise sum. The pseudocode can be found in Algorithm 2.

Algorithm 2 Aggregated ExM wrapper

Require: Range of hopping steps

[K_{s t a r t}

K_{e n d}]

hop

(A, k) \leftarrow E q u a t i o n (), \forall k \in [K_{s t a r t}, K_{e n d}]

while

l = 1 : L

H^{(l + 1)} \leftarrow F_{W} (A, H^{(l)})

while

k = K_{s t a r t} : K_{e n d}

H^{(l + 1)} \leftarrow H^{(l + 1)} \oplus F_{W} (hop (A, k), H^{(l)})

end while
end while

3.3. Implementation

Our ExM framework is an agnostic GNN plug-in. The implementation of it is shown in Algorithms 1 and 2. We analyse in the experiments which it way should be used according to different degrees and homophily ratios of graphs. The computational cost of Algorithm 1 is

O (H N)

in addition to the GNN required to store the re-wired graph topology, where the time cost remains the same as the original GNN. Moreover, Algorithm 2 not only requires

O (H N)

space but doubles the GNN computation since it needs the model to compute twice.

4. Experiments

We evaluate our proposed ExM wrapper through three types of experiments: (1) over-smoothness evaluation by stacking the GNN to 1000 layers on synthetic data to show the effectiveness of the ExM wrapper in mitigating over-smoothing issues; (2) benchmark graph re-wiring performance by comparing to prior re-wiring methods; and (3) benchmark graph representation learning performance on node classification using our ExM wrapper.

4.1. Experimental Setting

4.1.1. Dataset

The experiments were carried out on a set of nine publicly accessible graph datasets, also used in our previous work [33], encompassing three homophilous graphs (Cora, PubMed, and Citeseer) and six heterophilous graphs (Texas, Wisconsin, Chameleon, Squirrel, Roman-empire (Roman), and Amazon-ratings (Amazon)). The initial seven graphs are widely recognized for evaluating graph representation learning techniques, while the last two were introduced by [34] and are distinguished by their nodes without duplication. For the evaluation of graph representation learning, we employed 10 random splits (train:val:test = 6:2:2) for each of the first seven graphs, as suggested by [35]. Among the five new graph datasets introduced by [34], we utilized two of them. The remaining three datasets, which feature only two classes, were specifically employed for our ablation studies. Our split settings were consistent with those outlined by [34]. For further details regarding the profiles of the graph datasets employed, please refer to the Appendix A.

4.1.2. Experiment Setup

We conducted experiments using six baseline graph neural networks to evaluate our ExM framework: graph attention networks (GATs) [13], graph transformer networks (GTs) [36], GraphSAGE (referred to as SAGE) [37], NAGphormer (referred to as NAG) [9], JacobiConv [38], and a feature selection graph neural network (FSGNN) [39]. Among these, GAT, GT, and SAGE implementations were provided by [34], while the rest were implemented by their respective studies. We set the hyperparameters according to the best model the corresponding studies provided.

Our ExM wrapper was employed without altering the model architecture to ensure a fair comparison. This means that we employed the same number of GNN layers in each comparison. The performance is evaluated using the optimal combination of hyperparameters:

K_{s t a r t}

K_{e n d}

, and an exhaustive or progressive NLE mechanism. The impact of varying these hyperparameters is detailed in our ablation studies. Additionally, we present results along with the graph homophily ratio h, in line with the approach outlined in [15], providing further insights on how the characteristics of graph topology influence graph learning.

4.2. Results on Synthetic Data

Over-smoothness is delayed after using NLE. As shown in Figure 4, we adopted the experimental configuration outlined by [24,40] to examine the behavior of the Dirichlet energy in different graph re-wiring approaches including NLE. In addition to those baselines, NLE was applied to baselines with the initial residual connection (skip connection), which is one solution to prevent over-smoothness [41]. This investigation was conducted on a two-dimensional grid graph with dimensions

10 \times 10

and four-neighbor connectivity. The node features were sampled randomly from the uniform distribution

U

([

0, 1

]) and subsequently passed through a GNN with 1000 layers, employing randomly assigned weights. Specifically, GCN, GAT, GraphSAGE, NAGphormer, and G2GNN are used to show the effectiveness of NLE, where G2GNN [24] is a gated GNN for gradient between node features such that the node feature remains different regardless of the depth of the GNN.

According to Figure 4, NLE not only mitigates the risk of over-smoothing issues for normal baselines to learn non-local information but also enlarges the Dirichlet energy for baselines that have a low risk of over-smoothing.

4.3. Results on Real Data

4.3.1. Benchmark Graph Re-Wiring Techniques

Since transformer models have started to prevail in the GNN field, we selected the latest graph transformer model NAG as the reference to benchmark the effect of various graph re-wiring techniques, which included DropEdge, GDC, SDRF, C-ExM_E (cascaded with exhaustive NLE), A-ExM_E (aggregated with exhaustive NLE), C-ExM_P (cascaded with progressive NLE), and A-ExM_P (aggregated with progressive NLE).

Table 3 presents the results of graph re-wiring performance across nine node classification tasks. Leveraging non-local information led to improvements in node classification across various graph types. For instance, on homophilous graphs such as Cora (h = 0.81) and Pubmed (h = 0.80), as well as on heterophilous graphs like Roman-empire (h = 0.05), the baseline performance improved. Among five heterophilous graphs (

h \leq

0.24), four improved significantly (denoted by “*”). With the exception of Squirrel and Citeseer, where our method achieved the second-best performance, it secured a top-three position in all other cases. In contrast, previous graph re-wiring approaches faced challenges with homophilous graphs and exhibited lower accuracy on graphs with

h >

0.24.

4.3.2. Evaluation on Graph Feature Representation Learning

Table 4 illustrates the contrast between the baseline GNN model and baseline + ExM wrapper, where our ExM wrapper facilitates most baseline methods retaining SOTA performance. For clarity, we only report the highest performance among four variations of the ExM wrapper. However, it is worth noting that in the majority of cases, the C-ExM_P wrapper outperforms the other three counterparts. Specifically, the ExM wrapper allows us to successfully secure a top-three ranking across seven different graph datasets characterized by diverse homophily ratios h. The ExM wrapper is seamlessly integrated into various baselines, ranging from GAT to the recent FSGNN.

In addition, we include results from four very recent SOTA methods, GPRGNN, UGT, GloGNN, and GBKGNN, alongside the conventional GCN with five layers for comprehensive comparison. Consistent with the previous experiment, GNN models with the ExM wrapper often demonstrate more pronounced improvements in performance for graphs, especially with smaller h values. This aligns with the established understanding that heterophilous graphs necessitate specific model architectures for achieving SOTA performance. Conversely, we showcase that leveraging non-local information exchange leads to even higher accuracy, underscoring the efficiency of this approach in graph representation learning. Meanwhile, it is worth noting that SAGE + ExM achieves a new performance record on the Roman-empire dataset, as reported in the current leaderboard by [34].

4.3.3. Performance of Deep GNN

We alter the layer number of three baselines to show how over-smoothness is addressed on six real-world datasets. As shown in Figure 5, layer number has less influence on performance than model architecture. For example, GAT (blue bars) on the Roman-empire and Texas datasets shows decreasing accuracy with the original graph topology, while the orange bars increase for GAT along with our ExM.

4.3.4. Ablation Studies

The learning performance of the cascaded and the aggregated ExM does not manifest a significant difference when the graph average degree is low. It is worth noting that A-ExM is more suitable for high-degree heterophilous graphs, while C- and A-ExM are similar on homophilous graphs according to Table 3. For detailed ablation studies of C- vs. A-ExM, refer to Table A3, where the C-ExM performed better on graphs with lower degree.

Additionally, we conducted experiments with various combinations of

K_{s t a r t}

and

K_{e n d}

, where

K_{s t a r t} \in [1, 4]

and

K_{e n d} \in [2, 7]

, as detailed in Table A2. In summary, we found that the combination of

K_{s t a r t} = 1

and

K_{e n d} = 2

consistently performed the best across most graph datasets with five-layer baseline models. (1) Increasing

K_{e n d}

while keeping

K_{s t a r t} = 1

led to improved performance on heterophilous graphs whose k-steps distant nodes consistently contributed positive non-local information—e.g., Wisconsin (h = 0.20) and Cornell (h = 0.13) have rising curves when

K_{e n d}

goes from one to three. Given that some data can perform better when k = 4 but not 3, such as Roman-empire and Minesweeper,

K_{s t a r t}

= 1 and

K_{e n d}

= 2 is the safest setting to obtain better results. Conversely, the impact on homophilous graphs (e.g., Questions (h = 0.84), Cora (h = 0.81), Tolokers (h = 0.60), and Amazon-ratings (h = 0.38)) was relatively minor, as nodes of the same class tend to be closely connected. (2) Increasing

K_{s t a r t}

from one to four while maintaining a fixed range had no discernible benefit on Roman-empire (h = 0.05); when

K_{s t a r t}

increased, accuracies dropped from 83.74% to 75.97% and from 90.34% to 84.81%. The NLE with

k \geq 5

only showed notable improvement on synthetic data from Minesweeper, from 86.54% to 90.82% and 87.09 to 92.93%. In light of this, on real-world data, it is not worth applying NLE to very distant nodes since there is no promising improvement of all real-world data.

5. Discussion

First, the motivation of our graph re-wiring method, which provides a complementary approach to enhance the learning performance on graph data by adjusting the graph topology for the underlying GNN backbone, is illustrated in Figure 1. Our idea of re-wiring to hop neighbors has been used in a k-hop GNN [42] and k-hop tokenized graph transformer (aka. NAGphormer [9]), which connect all k-hop nodes at each GNN layer. In contrast, our non-local strategy opts to produce a new topology within the boundary nodes of hop neighbors and acts as a pre-processing step, which is more efficient and agnostic to various GNN models. The ease of utilizing our method to create embeddings of communities in social networks advances the aim of incorporating as much information as possible. Especially, the topic-oriented analysis for social networks [43] can advantageously make use of our method.

Second, the limitation of NLE is the minor improvement shown in homophilous graphs given the results in Table 4. This implies that nodes containing valuable non-local information are more distant than in heterophilous graphs. Future work addressing this limitation could incorporate the clustering of homophilous nodes to better determine hyperparameters

K_{s t a r t}, K_{e n d}

Lastly, skip connection has been explored by [1,41], which shares the same concept of establishing express connections as our non-local information exchange idea. Skip connection, however, is built on the GNN backbone not the topology. Although skip connection has demonstrated effectiveness in alleviating over-smoothing issues, we found that the Dirichlet energy could be further improved on top of the re-wired graph topology using our method (shown in Figure 4 on the right). This result implies that our graph re-wiring method and skip connection component are complementary to each other.

6. Conclusions

In this work, we address the challenge of graph feature over-smoothing from the novel perspective of topological re-wiring. We put the spotlight on the efficiency of effectively capturing global information with a very deep GNN model that handles a graph that has a set of direct connections between two distant but topologically connected nodes. By doing so, the new re-wired graph holds non-local connections and allows the GNN to learn global features while reducing the chance of over-smoothing features in the conventional node-after-node graph diffusion process. Following this notion, we present the express messenger wrapper, serving as an agnostic pre-processing method for re-wiring the graph topology via non-local information exchange, which is reminiscent of the non-local mean technique that prevailed in the field of image processing more than a decade ago. In practice, the new graph topology by our ExM can enhance expressibility for representation learning and instantly boost the node classification performance for current GNN models. Our analysis of Dirichlet energy shows that using our method can effectively mitigate over-smoothing issues. Experiments show that our ExM wrapper can outperform previous graph re-wiring methods and help various GNN backbones to achieve SOTA performance on nine homophilous/heterophilous graph datasets, indicating the great potential of our method in other graph learning applications such as brain connectomes and drug medicine data. Our future work will include (1) extensive benchmarks with existing GNN models and public datasets, (2) studying learnable graph re-wiring approaches to non-local information exchange, and (3) developing a deeper theoretical support for the interplay between topological re-wiring, i.e., the topology transformation.

Author Contributions

Conceptualization, Z.W., T.D., J.D., and G.W.; methodology, Z.W. and G.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and G.W.; resources, G.W.; data curation, Z.W. and T.D.; writing—original draft, Z.W.; writing—review and editing, Z.W., T.D., J.D., and G.W.; supervision, G.W.; project administration, G.W.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Institutes of Health AG070701, AG073927, AG068399, and Foundation of Hope.

Data Availability Statement

All graphs used in this research are available on this https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html. Dataset splitting is available on this https://github.com/yandex-research/heterophilous-graphs, accessed on 3 February 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Data Statistics

In Table A1, we show the profile of all graph datasets used in this paper.

Table A1. Graph data profiles.

	h	Node Number	Edge Number	Class Number
Questions	0.84	48,921	153,540	2
Cora	0.81	2708	10,556	7
Pubmed	0.80	19,717	88,648	3
Citeseer	0.74	3327	9104	6
Minesweeper	0.68	10,000	39,402	2
Tolokers	0.60	11,758	519,000	2
Amazon-ratings	0.38	24,492	93,050	5
Chameleon	0.24	2277	31,421	5
Squirrel	0.22	5201	198,493	5
Wisconsin	0.20	251	515	5
Cornell	0.13	183	298	5
Texas	0.11	183	325	5
Roman-empire	0.05	22,662	32,927	18

Appendix B. Detailed Analysis in Ablation Studies

We show the results of the three ablation studies, in addition to the experiments in the main text, to support our claims in Section 4.3.4. Please refer to Section 4.3.4 in the main text to see our description of the results presented below.

Table A2. Performance with cascaded ExM using different

[K_{s t a r t}, K_{e n d}]

, where baselines have 10 layers.

Table A2. Performance with cascaded ExM using different

[K_{s t a r t}, K_{e n d}]

, where baselines have 10 layers.

	$[K_{s t a r t}, K_{e n d}]$	Roman	Amazon	Mine.	Squirrel	Citeseer
GAT + C-ExM	$[1, 4]$	83.74 ± 0.55	46.44 ± 0.42	86.54 ± 0.78	61.55 ± 1.33	74.25 ± 1.27
	$[2, 5]$	76.44 ± 0.87	46.83 ± 0.72	90.56 ± 0.59	61.48 ± 0.99	73.58 ± 1.97
	$[3, 6]$	76.19 ± 0.82	46.76 ± 0.71	90.96 ± 0.74	62.51 ± 0.81	73.84 ± 1.27
	$[4, 7]$	75.97 ± 0.70	47.00 ± 0.61	90.82 ± 0.82	61.91 ± 1.32	74.25 ± 1.52
SAGE + C-ExM	$[1, 4]$	90.34 ± 0.42	50.25 ± 0.49	87.09 ± 0.79	43.93 ± 1.21	75.30 ± 1.74
	$[2, 5]$	85.00 ± 0.52	50.44 ± 0.79	92.51 ± 0.40	43.30 ± 1.86	74.57 ± 1.50
	$[3, 6]$	85.11 ± 0.50	51.30 ± 0.52	92.81 ± 0.39	44.85 ± 1.60	74.64 ± 1.72
	$[4, 7]$	84.81 ± 0.50	51.59 ± 0.39	92.93 ± 0.40	44.76 ± 1.79	74.35 ± 1.37

Table A3. Performance of cascaded vs. aggregated ExM on heterophilous graph dataset, where baselines have 3 layers.

Average Degree	Roman	Amazon.	Chameleon	Squirrel
Average Degree	1.45	3.80	13.80	38.16
GAT + C-ExM	80.77 ± 0.45	48.43 ± 0.44	70.88 ± 1.60	76.00 ± 0.95
GAT + A-ExM	79.99 ± 0.49	48.17 ± 0.53	73.18 ± 2.08	77.52 ± 0.86
FSGNN + C-ExM	72.61 ± 0.57	44.56 ± 0.59	75.99 ± 1.31	66.45 ± 4.17
FSGNN + A-ExM	72.17 ± 0.35	45.42 ± 0.45	77.76 ± 0.94	73.85 ± 2.08

References

Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Zhang, M.; Chen, Y. Link prediction based on graph neural networks. In Proceedings of the NIPS’18: 32nd International Conference on Neural Information Processing Systems, Montreal, ON, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Kreuzer, D.; Beaini, D.; Hamilton, W.; Létourneau, V.; Tossou, P. Rethinking graph transformers with spectral attention. Adv. Neural Inf. Process. Syst. 2021, 34, 21618–21629. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; Sun, X. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3438–3445. [Google Scholar]
Kim, J.; Nguyen, D.; Min, S.; Cho, S.; Lee, M.; Lee, H.; Hong, S. Pure transformers are powerful graph learners. Adv. Neural Inf. Process. Syst. 2022, 35, 14582–14595. [Google Scholar]
Chen, J.; Gao, K.; Li, G.; He, K. NAGphormer: A tokenized graph transformer for node classification in large graphs. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2022. [Google Scholar]
He, X.; Hooi, B.; Laurent, T.; Perold, A.; Lecun, Y.; Bresson, X. A Generalization of ViT/MLP-Mixer to Graphs. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research, PMLR: Birmingham, UK, 2023; Volume 202, pp. 12724–12745. [Google Scholar]
Cai, S.; Li, L.; Deng, J.; Zhang, B.; Zha, Z.J.; Su, L.; Huang, Q. Rethinking Graph Neural Architecture Search From Message-Passing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6657–6666. [Google Scholar]
Feng, J.; Chen, Y.; Li, F.; Sarkar, A.; Zhang, M. How powerful are k-hop message-passing graph neural networks. Adv. Neural Inf. Process. Syst. 2022, 35, 4776–4790. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Mo, Y.; Peng, L.; Xu, J.; Shi, X.; Zhu, X. Simple unsupervised graph representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 7797–7805. [Google Scholar]
Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. Adv. Neural Inf. Process. Syst. 2020, 33, 7793–7804. [Google Scholar]
Zheng, X.; Wang, Y.; Liu, Y.; Li, M.; Zhang, M.; Jin, D.; Yu, P.S.; Pan, S. Graph neural networks for graphs with heterophily: A survey. arXiv 2022, arXiv:2202.07082. [Google Scholar]
Rong, Y.; Huang, W.; Xu, T.; Huang, J. Dropedge: Towards deep graph convolutional networks on node classification. arXiv 2019, arXiv:1907.10903. [Google Scholar]
Gasteiger, J.; Weißenberger, S.; Günnemann, S. Diffusion improves graph learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Battiston, F.; Cencetti, G.; Iacopini, I.; Latora, V.; Lucas, M.; Patania, A.; Young, J.G.; Petri, G. Networks beyond pairwise interactions: Structure and dynamics. Phys. Rep. 2020, 874, 1–92. [Google Scholar]
Alon, U.; Yahav, E. On the bottleneck of graph neural networks and its practical implications. arXiv 2020, arXiv:2006.05205. [Google Scholar]
Topping, J.; Di Giovanni, F.; Chamberlain, B.P.; Dong, X.; Bronstein, M.M. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv 2021, arXiv:2111.14522. [Google Scholar]
Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2005; Volume 2, pp. 60–65. [Google Scholar]
Coifman, R.R.; Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006, 21, 5–30. [Google Scholar] [CrossRef]
Rusch, T.K.; Chamberlain, B.P.; Mahoney, M.W.; Bronstein, M.M.; Mishra, S. Gradient gating for deep multi-rate learning on graphs. arXiv 2022, arXiv:2210.00513. [Google Scholar]
Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, Z.; Liu, Q.; Hu, Q.; Lee, C.K. Hierarchical graph transformer with adaptive node sampling. Adv. Neural Inf. Process. Syst. 2022, 35, 21171–21183. [Google Scholar]
Zhang, J.; Zhang, H.; Xia, C.; Sun, L. Graph-bert: Only attention is needed for learning graph representations. arXiv 2020, arXiv:2001.05140. [Google Scholar]
Morris, C.; Ritzert, M.; Fey, M.; Hamilton, W.L.; Lenssen, J.E.; Rattan, G.; Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4602–4609. [Google Scholar]
Morris, C.; Fey, M.; Kriege, N.M. The power of the Weisfeiler-Leman algorithm for machine learning with graphs. arXiv 2021, arXiv:2105.05911. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1. [Google Scholar]
Wei, Z.; Wu, G. Non-local Exchange: Introduce Non-locality via Graph re-wiring to Graph Neural Networks. In Proceedings of the NeurIPS 2024 Workshop on Behavioral Machine Learning, Vancouver, BC, Canada, 14 December 2024. [Google Scholar]
Platonov, O.; Kuznedelev, D.; Diskin, M.; Babenko, A.; Prokhorenkova, L. A critical look at the evaluation of GNNs under heterophily: Are we really making progress? arXiv 2023, arXiv:2302.11640. [Google Scholar]
Pei, H.; Wei, B.; Chang, K.C.C.; Lei, Y.; Yang, B. Geom-gcn: Geometric graph convolutional networks. arXiv 2020, arXiv:2002.05287. [Google Scholar]
Shi, Y.; Huang, Z.; Feng, S.; Zhong, H.; Wang, W.; Sun, Y. Masked label prediction: Unified message-passing model for semi-supervised classification. arXiv 2020, arXiv:2009.03509. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Wang, X.; Zhang, M. How powerful are spectral graph neural networks. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: Birmingham, UK, 2022; pp. 23341–23362. [Google Scholar]
Maurya, S.K.; Liu, X.; Murata, T. Simplifying approach to node classification in graph neural networks. J. Comput. Sci. 2022, 62, 101695. [Google Scholar] [CrossRef]
Rusch, T.K.; Chamberlain, B.; Rowbottom, J.; Mishra, S.; Bronstein, M. Graph-coupled oscillator networks. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: Birmingham, UK, 2022; pp. 18888–18909. [Google Scholar]
Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; Li, Y. Simple and deep graph convolutional networks. In Proceedings of the International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2020; PMLR: Birmingham, UK, 2020; pp. 1725–1735. [Google Scholar]
Nikolentzos, G.; Dasoulas, G.; Vazirgiannis, M. k-hop graph neural networks. Neural Netw. 2020, 130, 195–205. [Google Scholar] [CrossRef] [PubMed]
Cauteruccio, F.; Kou, Y. Investigating the emotional experiences in eSports spectatorship: The case of League of Legends. Inf. Process. Manag. 2023, 60, 103516. [Google Scholar] [CrossRef]

Figure 1. The relationship between the expressibility of the original and 3 underlying topologies of graphs and modern GNN performance (in node classification accuracy). Different landmarks represent different datasets. Colors denote graph re-wiring methods. Red arrow lines highlight the improvement by our re-wiring method. Red box explains ours preduces an easier graph to classify via changing the topology as nodes with same class denoted by colored oval being more separated. Note that all re-wiring methods are applied with the same baseline hyperparameter.

Figure 2. Non-local information exchange mechanism (right), where colors of node denote the distance marked by numbers between a node to the red one, nodes with mixed color denote aggregated node feature by message-passing, solid lines are edges of graph, and dashed lines denote express connections. The technique reminiscent of non-local mean technique for image processing (left), which is able to capture global information by express connections that are denoted by red dashed lines reducing the over-smoothing risk in GNNs. Both ideas integrate information beyond either a spatial or topological neighbor, in order to preserve distinctive feature representations.

Figure 3. (left): Illustration of progressive NLE for simulated graph with original adjacency matrix

A

to re-wired topology

h o p (A, k)

(

k = 1, 2, \dots

). (right): ExM sorts original graph and new graphs cascaded (C-ExM) or aggregated (A-ExM) to input to any GNN. Green arrow indicates the pipeline of an arbitrary GNN.

Figure 3. (left): Illustration of progressive NLE for simulated graph with original adjacency matrix

A

to re-wired topology

h o p (A, k)

(

k = 1, 2, \dots

). (right): ExM sorts original graph and new graphs cascaded (C-ExM) or aggregated (A-ExM) to input to any GNN. Green arrow indicates the pipeline of an arbitrary GNN.

Figure 4. Comparison of re-wiring methods between DropEdge, GDC, and our NLE. NLE can mitigate over-smoothing issues. Compared with previous graph re-wiring methods, over-smoothness is delayed after using NLE. Even though G2GNN or using skip connection almost eliminated smoothed node features, using NLE leads to a larger Dirichlet energy than the original graph topology.

Figure 5. Bar plots of performance by using different layer numbers on real data.

Table 1. A summary of the novelty of our Non-Local Exchange (NLE) method, where ✗ and ✓denote if the method has the function, respectively.

	GCN	GDC	DropEdge	NAGphormer	NLE (Ours)
Long-range interaction	✗	✗	✗	✓	✓
Graph re-wiring	✗	✓	✓	✗	✓
Anti-over-smoothing	✗	✗	✓	✓	✓

Table 2. Comparing expressibility changes by related re-wiring methods and our NLE.

	h	Original	DropEdge	GDC	NLE (Ours)
Cham.	0.24	26.54	24.56 (−1.98)	33.55 (+7.01)	35.31 (+8.77)
Squirrel	0.22	22.96	23.44 (+0.48)	22.96 (−0.00)	22.96 (−0.00)
Wiscon.	0.20	47.06	41.18 (−5.88)	41.18 (−5.88)	50.98 (+3.92)
Texas	0.11	64.87	64.87 (−0.00)	64.87 (−0.00)	70.27 (+5.40)
Roman.	0.05	24.32	22.71 (−1.61)	19.91 (−4.41)	28.08 (+3.76)

Note that red text highlights the performance dropping, and green the performance enhancing.

Table 3. Benchmark results for graph re-wiring performance. NAG is the baseline here. Red denotes the first rank, followed by blue (second), and violet (third). “E” means exhaustive NLE, and “P” is progressive NLE. “–” means the model reported no result.

	Roman	Texas	Wisconsin	Squirrel	Chameleon	Amazon	Citeseer	Pubmed	Cora
	h = 0.05	h = 0.11	h = 0.20	h = 0.22	h = 0.24	h = 0.38	h = 0.74	h = 0.80	h = 0.81
Baseline	${73.57}_{\pm 1.30}$	${70.54}_{\pm 3.07}$	${67.45}_{\pm 1.80}$	${34.85}_{\pm 0.85}$	${46.05}_{\pm 1.10}$	${47.08}_{\pm 0.60}$	${71.32}_{\pm 0.65}$	${87.65}_{\pm 0.24}$	${86.08}_{\pm 0.69}$
DropEdge	${75.97}_{\pm 0.83}$	${66.48}_{\pm 3.24}$	${61.18}_{\pm 2.12}$	${34.49}_{\pm 1.13}$	${47.23}_{\pm 1.09}$	${45.34}_{\pm 0.50}$	${72.40}_{\pm 0.46}$	${87.33}_{\pm 0.46}$	${69.08}_{\pm 0.67}$
GDC	${75.62}_{\pm 0.26}$	${72.43}_{\pm 1.62}$	${70.19}_{\pm 3.37}$	${33.84}_{\pm 0.93}$	${45.48}_{\pm 1.09}$	${46.20}_{\pm 0.55}$	${71.76}_{\pm 0.58}$	${87.80}_{\pm 0.28}$	${85.67}_{\pm 0.75}$
SDRF	–	${70.35}_{\pm 0.60}$	${61.55}_{\pm 0.86}$	${37.67}_{\pm 0.23}$	${44.46}_{\pm 0.17}$	–	${72.58}_{\pm 0.20}$	${79.10}_{\pm 0.11}$	${82.76}_{\pm 0.23}$
C-ExM_E	${73.99}_{\pm 0.49}$	${73.51}_{\pm 2.02}^{*}$	${72.75}_{\pm 2.05}^{*}$	${35.47}_{\pm 1.25}$	${48.86}_{\pm 1.12}$	${46.79}_{\pm 0.70}$	${71.59}_{\pm 0.64}$	${88.64}_{\pm 0.38}$	${86.40}_{\pm 0.37}$
A-ExM_E	${76.51}_{\pm 0.72}^{*}$	${71.62}_{\pm 3.02}$	${70.79}_{\pm 0.50}$	${34.98}_{\pm 0.69}$	${47.39}_{\pm 1.60}$	${47.34}_{\pm 0.67}$	${72.43}_{\pm 0.76}$	${87.96}_{\pm 0.24}$	${86.68}_{\pm 0.55}$
C-ExM_P	${69.78}_{\pm 0.99}$	${72.70}_{\pm 3.07}$	${62.55}_{\pm 0.43}$	${37.42}_{\pm 1.19}^{*}$	${48.93}_{\pm 1.53}$	${49.67}_{\pm 0.48}$	${70.92}_{\pm 0.39}$	${87.87}_{\pm 0.32}$	${86.58}_{\pm 0.25}$
A-ExM_P	${77.00}_{\pm 0.59}^{*}$	${68.92}_{\pm 3.25}$	${54.90}_{\pm 0.88}$	${34.49}_{\pm 1.01}$	${58.60}_{\pm 1.12}^{*}$	${48.33}_{\pm 0.46}$	${71.70}_{\pm 0.76}$	${87.18}_{\pm 0.40}$	${86.42}_{\pm 0.47}$

Note that * indicates a significant improvement by ours comparing to the baseline.

Table 4. Performance by using NLE compared to other SOTA methods. Red denotes the first rank, blue the second, and violet the third; “–” means not applicable, and “Imp.” stands for “improvement”.

	Roman	Texas	Wisconsin	Squirrel	Chameleon	Citeseer	Pubmed	Average
	h = 0.05	h = 0.11	h = 0.20	h = 0.22	h = 0.24	h = 0.74	h = 0.80	Imp.
GCN	${73.69}_{\pm 0.74}$	${59.46}_{\pm 5.25}$	${59.80}_{\pm 6.99}$	${36.89}_{\pm 1.34}$	${59.82}_{\pm 2.58}$	${76.68}_{\pm 1.64}$	${87.38}_{\pm 0.66}$	–
GPRGNN	${64.85}_{\pm 0.27}$	${92.92}_{\pm 0.61}$	–	${49.93}_{\pm 0.53}$	${67.48}_{\pm 0.40}$	${67.63}_{\pm 0.38}$	${85.07}_{\pm 0.09}$	–
UGT	–	${86.67}_{\pm 8.31}$	${81.60}_{\pm 8.24}$	${66.96}_{\pm 2.49}$	${69.78}_{\pm 3.21}$	${76.08}_{\pm 2.50}$	–	–
GloGNN	${59.63}_{\pm 0.69}$	${84.32}_{\pm 4.15}$	${87.06}_{\pm 3.53}$	${57.54}_{\pm 1.39}$	${69.78}_{\pm 2.42}$	${77.41}_{\pm 1.65}$	${89.62}_{\pm 0.35}$	–
GBKGNN	${74.57}_{\pm 0.47}$	${81.08}_{\pm 4.88}$	${84.21}_{\pm 4.33}$	–	–	${79.18}_{\pm 0.96}$	${89.11}_{\pm 0.23}$	–
GAT	${80.92}_{\pm 0.68}$	${58.92}_{\pm 5.81}$	${60.39}_{\pm 3.67}$	${62.00}_{\pm 1.29}$	${69.85}_{\pm 1.72}$	${74.03}_{\pm 1.23}$	${87.86}_{\pm 0.42}$	${7.80}_{\pm 6.82}$
w/ExM	${86.06}_{\pm 0.35}$	${75.14}_{\pm 7.73}$	${72.35}_{\pm 7.36}$	${69.54}_{\pm 1.30}$	${73.18}_{\pm 1.60}$	${74.25}_{\pm 1.27}$	${88.04}_{\pm 0.42}$	${7.80}_{\pm 6.82}$
GT	${85.70}_{\pm 0.99}$	${62.43}_{\pm 7.80}$	${57.65}_{\pm 4.91}$	${58.09}_{\pm 1.50}$	${68.99}_{\pm 2.55}$	${74.68}_{\pm 1.46}$	${87.51}_{\pm 0.52}$	${2.95}_{\pm 3.29}$
w/ExM	${89.20}_{\pm 0.63}$	${68.11}_{\pm 10.1}$	${67.06}_{\pm 8.71}$	${58.09}_{\pm 1.50}$	${70.92}_{\pm 1.44}$	${74.83}_{\pm 1.22}$	${87.51}_{\pm 0.52}$	${2.95}_{\pm 3.29}$
SAGE	${86.96}_{\pm 0.56}$	${80.27}_{\pm 5.70}$	${81.37}_{\pm 5.49}$	${44.53}_{\pm 1.08}$	${62.92}_{\pm 1.70}$	${75.19}_{\pm 1.51}$	${88.59}_{\pm 0.38}$	${1.55}_{\pm 1.42}$
w/ExM	${90.34}_{\pm 0.42}$	${82.97}_{\pm 6.38}$	${84.71}_{\pm 2.23}$	${45.49}_{\pm 1.17}$	${62.92}_{\pm 1.70}$	${75.51}_{\pm 1.46}$	${88.75}_{\pm 0.38}$	${1.55}_{\pm 1.42}$
NAG	${73.57}_{\pm 1.30}$	${70.54}_{\pm 3.07}$	${67.45}_{\pm 1.80}$	${34.85}_{\pm 0.85}$	${46.05}_{\pm 1.10}$	${71.32}_{\pm 0.65}$	${87.65}_{\pm 0.24}$	${4.13}_{\pm 3.69}$
w/ExM	${77.00}_{\pm 0.59}$	${73.51}_{\pm 2.02}$	${72.75}_{\pm 2.05}$	${37.42}_{\pm 1.19}$	${58.60}_{\pm 1.12}$	${72.43}_{\pm 0.68}$	${88.64}_{\pm 0.38}$	${4.13}_{\pm 3.69}$
JacobiConv	${71.25}_{\pm 0.45}$	${76.49}_{\pm 6.74}$	${76.67}_{\pm 5.00}$	${50.21}_{\pm 2.39}$	${67.94}_{\pm 1.13}$	${76.00}_{\pm 1.44}$	${88.95}_{\pm 0.46}$	${1.73}_{\pm 1.11}$
w/ExM	${72.75}_{\pm 0.64}$	${79.46}_{\pm 3.86}$	${79.22}_{\pm 4.49}$	${53.07}_{\pm 2.54}$	${69.89}_{\pm 1.50}$	${76.08}_{\pm 1.54}$	${89.14}_{\pm 0.42}$	${1.73}_{\pm 1.11}$
FSGNN	${67.93}_{\pm 0.53}$	${85.95}_{\pm 5.77}$	${85.88}_{\pm 4.94}$	${73.55}_{\pm 2.16}$	${78.16}_{\pm 1.11}$	${76.52}_{\pm 1.76}$	${89.63}_{\pm 0.40}$	${1.04}_{\pm 1.53}$
w/ExM	${72.61}_{\pm 0.57}$	${86.49}_{\pm 4.52}$	${87.06}_{\pm 3.84}$	${73.85}_{\pm 2.08}$	${78.22}_{\pm 0.81}$	${76.95}_{\pm 1.25}$	${89.72}_{\pm 0.49}$	${1.04}_{\pm 1.53}$
Average Imp.	${3.61}_{\pm 1.16}$	${5.18}_{\pm 5.16}$	${5.62}_{\pm 3.85}$	${4.04}_{\pm 6.13}$	${3.30}_{\pm 4.29}$	${0.39}_{\pm 0.34}$	${0.27}_{\pm 0.33}$	${3.20}_{\pm 4.25}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, Z.; Dan, T.; Ding, J.; Wu, G. Efficient Graph Representation Learning by Non-Local Information Exchange. Electronics 2025, 14, 1047. https://doi.org/10.3390/electronics14051047

AMA Style

Wei Z, Dan T, Ding J, Wu G. Efficient Graph Representation Learning by Non-Local Information Exchange. Electronics. 2025; 14(5):1047. https://doi.org/10.3390/electronics14051047

Chicago/Turabian Style

Wei, Ziquan, Tingting Dan, Jiaqi Ding, and Guorong Wu. 2025. "Efficient Graph Representation Learning by Non-Local Information Exchange" Electronics 14, no. 5: 1047. https://doi.org/10.3390/electronics14051047

APA Style

Wei, Z., Dan, T., Ding, J., & Wu, G. (2025). Efficient Graph Representation Learning by Non-Local Information Exchange. Electronics, 14(5), 1047. https://doi.org/10.3390/electronics14051047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Graph Representation Learning by Non-Local Information Exchange^†

Abstract

1. Introduction

2. Preliminaries

2.1. Related Works

2.2. Proxy Measurement for Expressibility

3. Methods

3.1. Graph Re-Wiring by Hierarchical Non-Local Information Exchange

3.2. Express Messenger (ExM) Wrapper

3.2.1. Exhaustive vs. Progressive NLE

Cascaded Express Messenger (C-ExM) Wrapper

Aggregated Express Messenger (A-ExM) Wrapper

3.3. Implementation

4. Experiments

4.1. Experimental Setting

4.1.1. Dataset

4.1.2. Experiment Setup

4.2. Results on Synthetic Data

4.3. Results on Real Data

4.3.1. Benchmark Graph Re-Wiring Techniques

4.3.2. Evaluation on Graph Feature Representation Learning

4.3.3. Performance of Deep GNN

4.3.4. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Data Statistics

Appendix B. Detailed Analysis in Ablation Studies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficient Graph Representation Learning by Non-Local Information Exchange †

Abstract

1. Introduction

2. Preliminaries

2.1. Related Works

2.2. Proxy Measurement for Expressibility

3. Methods

3.1. Graph Re-Wiring by Hierarchical Non-Local Information Exchange

3.2. Express Messenger (ExM) Wrapper

3.2.1. Exhaustive vs. Progressive NLE

Cascaded Express Messenger (C-ExM) Wrapper

Aggregated Express Messenger (A-ExM) Wrapper

3.3. Implementation

4. Experiments

4.1. Experimental Setting

4.1.1. Dataset

4.1.2. Experiment Setup

4.2. Results on Synthetic Data

4.3. Results on Real Data

4.3.1. Benchmark Graph Re-Wiring Techniques

4.3.2. Evaluation on Graph Feature Representation Learning

4.3.3. Performance of Deep GNN

4.3.4. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Data Statistics

Appendix B. Detailed Analysis in Ablation Studies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Efficient Graph Representation Learning by Non-Local Information Exchange^†