[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
RST Resilient Watermarking Scheme Based on DWT-SVD and Scale-Invariant Feature Transform
Previous Article in Journal
From Intrusion Detection to an Intrusion Response System: Fundamentals, Requirements, and Future Directions
Previous Article in Special Issue
A Geo-Clustering Approach for the Detection of Areas-of-Interest and Their Underlying Semantics
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fuzzy Random Walkers with Second Order Bounds: An Asymmetric Analysis

by
Georgios Drakopoulos
1,
Andreas Kanavos
1,* and
Konstantinos Tsakalidis
2
1
Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece
2
Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L3G1, Canada
*
Author to whom correspondence should be addressed.
Algorithms 2017, 10(2), 40; https://doi.org/10.3390/a10020040
Submission received: 22 December 2016 / Revised: 22 March 2017 / Accepted: 27 March 2017 / Published: 30 March 2017
(This article belongs to the Special Issue Humanistic Data Processing)

Abstract

:
Edge-fuzzy graphs constitute an essential modeling paradigm across a broad spectrum of domains ranging from artificial intelligence to computational neuroscience and social network analysis. Under this model, fundamental graph properties such as edge length and graph diameter become stochastic and as such they are consequently expressed in probabilistic terms. Thus, algorithms for fuzzy graph analysis must rely on non-deterministic design principles. One such principle is Random Walker, which is based on a virtual entity and selects either edges or, like in this case, vertices of a fuzzy graph to visit. This allows the estimation of global graph properties through a long sequence of local decisions, making it a viable strategy candidate for graph processing software relying on native graph databases such as Neo4j. As a concrete example, Chebyshev Walktrap, a heuristic fuzzy community discovery algorithm relying on second order statistics and on the teleportation of the Random Walker, is proposed and its performance, expressed in terms of community coherence and number of vertex visits, is compared to the previously proposed algorithms of Markov Walktrap, Fuzzy Walktrap, and Fuzzy Newman–Girvan. In order to facilitate this comparison, a metric based on the asymmetric metrics of Tversky index and Kullback–Leibler divergence is used.

1. Introduction

The Random Walker principle is the algorithmic cornerstone for building a number of heuristics for large graphs, namely for those with the fundamental property that neither their vertex nor their edge set fits in main memory. Such heuristics are efficient in terms either of computation time or memory requirements or often both. Under this principle, a virtual entity usually named the Random Walker visits the vertices. Within the scope of this article, the probabilistic strategy followed by the Random Walker to decide which vertex will visit next is of paramount importance, although, depending on the problem under study, other properties of the Random Walker may be of interest.
Virtual or ideal entities play an important role in science and engineering, mainly as a means to prove a theorem, to establish ideal performance limits, and to provide grounds for rejecting a conjecture based on a reductio ad absurdum methodology. Consider, for instance, the particle sorting demon of Maxwell [1,2] with its connections to algorithmic information theory and the steam engine of Heron of Alexandria [3]. In addition, the Random Walker principle itself has been applied to a number of graph analytics such as vertex similarity [4] and graph cuts [5] as well as to image processing [6].
A central finding of the theory of scale-free graphs, corroborated by significant evidence from a broad range of fields such as genomics, computational neuroscience, combinatorics, and social network analysis, states that they strongly tend to exhibit modularity. Namely, scale free graphs are recursively composed of vertex communities. The latter is a crucial factor for simultaneously achieving scalability, low-delay local information propagation, and structural robustness [7,8]. For instance, a large online social media group for Roman history afficionados can be composed of overlapping but to a great extent distinct specialized communities about Roman politics, law, and military, whereas the latter may be further subdivided into smaller communities regarding weaponry, tactics, legion standards, key battles, distinguished commanders, and the Praetorian guard.
Even though knowledge about communities offers a deep insight to graph structure, locating them is intractable. Consequently, numerous community discovery algorithms have been developed. Two of the most prominent ones are the Newman–Girvan [8] and the Walktrap [9] algorithms. The former is deterministic and is based on local edge density, while the latter is heuristic and relies on an edge crossing Random Walker. In [10], variants of both algorithms for edge-fuzzy graphs were proposed. Furthermore, in [11], a random walk approach for community detection in complex networks is introduced.
This article extends the work of [12]. Initially, the implementation of the Markov Walktrap algorithm, which is an extended version of Fuzzy Walktrap, was firstly proposed in [10]. Fuzzy Walktrap in turn has been based on deterministic Walktrap algorithm [9]. Furthermore, a method for algorithmically assessing the performance of Markov Walktrap relative to Fuzzy Walktrap and Fuzzy Newman–Girvan, being another fuzzy community detection technique [10], are thoroughly presented. Finally, one very important aspect to be mentioned is the expression of Markov Walktrap in Cypher, which composes the main query language for Neo4j [13].
The primary contribution of this article is the implementation over Neo4j of Chebyshev Walktrap, a community discovery algorithm designed for edge-fuzzy graphs, a class of fuzzy graphs [14] used among others in [10], which is based on the Random Walker algorithmic principle. Additionally, Chebyshev Walktrap relies on the competitive factors of second order statistics though the Chebyshev inequality and on an optional relocation capability in order to bound unnecessarily costly walks and, thus, remaining inside a community and being trapped for too long within the boundaries of a community, respectively. The relocation aspect was also backported to the Markov Walktrap algorithm first proposed in [15]. The effect of relocation on the community coherence was evaluated based on the asymmetric Tversky index using the Fuzzy Newman–Girvan algorithm from [15] as baseline, while its effect on the output distribution was assessed with the asymmetric Kullback–Leibler divergence.
The remainder of this article is structured as follows. Scientific literature regarding community discovery is reviewed in Section 2. The fuzzy graph model is outlined in Section 3 while Markov Walktrap and Fuzzy Walktrap algorithms are presented in Section 4. Experimental results are discussed in Section 5. Finally, the main findings of this paper as well as future research directions are discussed in Section 6. Table 1 summarizes the symbols used in this paper.

2. Related Work

From a system perspective, graph analytics play a crucial role in data mining systems such as Google Pregel or in massive machine learning frameworks such as GraphLab (https://turi.com/). Moreover, recently, there is strong interest for scalable, production grade graph databases [13,16] such as BrightStar (https://brightstardb.com), Neo4j (www.neo4j.com), Titan (https://github.com/thinkaurelius/titan), Sparksee (www.sparsity-technologies.com) and GraphDB (www.ontotext.com).
Traditionally, from an algorithmic viewpoint, analytics include structural [17,18] and spectral [19,20] partitioning, where a graph is split according to some functional constraints such as flow or edge density. Efficient information diffusion in large graphs is also of interest [21,22], especially for online political campaigns and digital marketing. The Random Walker principle has been also applied to two other important metrics, namely vertex similarity [4] and heuristic minimum cuts [5]. Both metrics can also be treated deterministically, especially in the context of social network analysis [23,24]. Community structure discovery provides insight to the inner workings of a particular graph [7,8,17], while metrics such as those in [25] control the discovery process quality. Persistent graphs can be instrumental in designing rollback capabilities in graph databases [26].
Among the numerous applications of graphs or linked data, one can find Web searching and ranking [27] with established algorithms such as PageRank [28] and HITS [29]. Bibliometric and scientometric data analysis [30] can boost collaboration between researchers, while image segmentation [6] is central to computer vision and robotics. Social network analysis has greatly benefited from structural [31] or functional [32,33] community detection algorithms. Additionally, influence and perceived social status in online social media have been tied to the participation in communities [34]. Message diffusion within a social graph is studied [34,35], while [36,37,38] deal with emotional modeling with respect to user influence [36]. Finally, random walkers have served as models for the propagation of computer viruses both in single systems and in networks, including LANs and the Internet [39,40]. Within this context, the strategy or the mix of strategies followed by the random walker is of paramount importance as it affects the entity and number of resources susceptible to infection.
First and second order statistics are used across a number of fields. In [41], a channel estimation methodology based on first order statistics is proposed. Methods for blind source separation using second order statistics include [42,43]. A comprehensive approach about the applications of higher order methods is given in [44] signal processing and in [45] for biomedical engineering. In [46], a third order method was presented for adaptively scheduling biosignal processing applications at the operating system level. Independent Component Analysis (ICA), a powerful signal processing technique, is based on higher order spectra [47]. Among the multitude of ICA applications is source separation in EEG waveforms [48].

3. Edge-Fuzzy Graphs

3.1. Definitions

Within the scope of this paper, the edge-fuzzy graphs are probabilistic and combinatorial hybrids comprised of a fixed set of vertices V and a fuzzy set of edges E ˜ . Formally,
Definition 1.
A homogeneous edge-fuzzy graph is the ordered triplet
G = V , E ˜ , h ,
where V = v k is the set of vertices, E ˜ = e k V × V is the fuzzy set of edges, and h is the edge membership function h : E 0 , 1 , which quantifies the degree of participation of each e k to G [10]. Moreover:
  • the vertex set V is fixed, namely they belong to G with probability one,
  • the distribution h is the same for each edge,
  • the existence probability of e k is drawn independently for each edge.
A subtle point is that h does not affect the structural properties of the graph in the sense that no edges are added or deleted, except when for a particular e k holds that h e k = 0 . If h is continuous, the probability of this ocurring is zero. However, if h is discrete, then depending on h a potentially non-negligible portion of the edges may be deleted. In this article, h was chosen so that only at most an exponentially small proportion of the edges would be assigned to a zero weight. Consequently, the underlying graph preserved its original structure along with any associated connectivity patterns. Otherwise, if a considerable fraction of edges were to be deleted, then the resulting graph would behave more like an Erdös-Rényi graph. The latter are known to be easily constructed by randomly sampling a graph space or, equivalently, the edges of K n but their properties deviate in a significant way from those of real world, large graphs.
Observe that there is no fuzziness whatsoever regarding vertices as by definition they always exist with probability one. In scientific literature, the existence of fuzzy graph classes is prominent. Vertices are fuzzy as well as their fuzziness interacts with that of the edges, mostly by long product chains. Such graphs are beyond the scope of this article.
Definition 2.
Under the fuzzy graph model of Defintion 1, the cost δ e k of traversing e k is
δ e k = 1 h e k 1 , + , h e k 0 , 1 ,
which expresses the intuitive requirement that edges which are less likely to belong to the graph are also harder to cross.
Depending on the application, δ may well be connected through another non-linear transform to h as long as edges with high h e k are easy to cross and edges with low h are difficult to cross such as
δ e k = 1 1 + h 2 e k 1 2 , 1 , δ e k = ln 1 h e k = ln h e k 0 , + .
Definition 3.
The cost Δ p j of a fuzzy path p j = e 1 , , e m is the sum of the cost of its individual edges
Δ p j = e k p j δ e k = k = 1 m 1 h e k = m H h e 1 , , h e m , h e k 0 , 1 k m ,
where H h e 1 , , h e m is the harmonic mean of h e k defined as
H s 1 s n = 1 1 n k = 1 n 1 s k = n 1 s 1 + + 1 s n , s k 0 , 1 k n .
By construction Δ p j is bounded as follows
1 max 1 k m h e k Δ p j p j 1 min 1 k m h e k .
Whether the above bounds are loose depends on the variability of the actual values h e k . That is, if the latter are drawn from a distribution which favors extreme values, Δ p j will tend to be close to these bounds. It should also be noted that the variance Δ p j is strongly dependent on that of h e k . Moreover, Δ p j is prone to outliers, which might lead to an unrealistically high average fuzzy path length. This can be remedied by taking into account the variance of the fuzzy path length. Finally, a low Δ p j tends to contain almost exclusively low edge costs δ e k or, equivalently, edges with high probability of belonging to the graph, an argument which agrees with the weak law of large numbers. In turn, this suggests the intuitive corollary that low cost paths are comprised almost exclusively of edges that are less likely to be fuzzy. This corollary can be used in order to design efficient hybrid probabilistic and combinatorial algorithms based on dynamic programming for finding and enumerating low cost paths akin to the way a similar observation has led to the development of shortest paths relying on dynamic programming in deterministic graphs.
From a probabilistic viewpoint, the sum k = 1 m δ e k is interesting by itself as a finite but possibly large sum of inverse random variables. Notice that the central limit theorem may not be applied in such a setting since the variance of the h e k might be infinite. If this is not the case, different bounds can be computed depending on the distribution of h e k such as the abovementioned central limit theorem, a Poisson bound, a power law bound, or finally approximations based on Markov or Chebyshev inequalities, the Chernoff bound, or on the Gnedenko extreme value theorem. Notice that the effect of a single edge which is exceedingly difficult to cross can be instrumental in shaping graph communities.
The numerical properties of the above sum are also of interest. As values of possibly uneven orders of magnitude may be added, catastrophic cancellation may occur resulting in the loss of meaningful information. This might happen if the summation is executed in an order left to the implementation. On the other hand, the summation order dictated by the Priest algorithm [49] results in the least possible loss of significant decimal digits by adding only numbers of comparable magnitude. Another option would be to substitute the harmonic mean with its thresholded counterpart
H s 1 , , s m ; τ 0 = m 1 m 1 max h e k , τ 0 = m 1 max h e 1 , τ 0 + + 1 max h e m , τ 0 .
For other uses of the thresholded harmonic and geometric means, see [50], while, for the effect of finite precision arithmetic to long biosignals, see [51].
An alternative for long paths would be to substitute, under certain conditions, the finite sum with an appropriate integral. Assuming with no loss of generality that h e 1 = min 1 k m h e k 0 and h e m = max 1 k m h e k 0 , then
Δ p j = k = 1 m 1 h e k ρ 0 + h e 1 h e m 1 x d x = ln h e m ln h e 1 + ρ 0 = ln h e m h e 1 + ρ 0 ,
where ρ 0 is an optional correction factor. A finer approach requiring more probabilistic information about the longer paths of a given graph would be to partition such a path p j so that
Δ p j = k = 1 m 1 h e k i = 1 n ln h e u i ln h e i + ρ i = i = 1 n ln h e u i ln h e i + i = 1 n ρ i = i = 1 n ln h e u i ln h e i + ρ 0 .
Selecting n and forming the sets h e u i i = 1 n , h e i i = 1 n , and ρ i i = 1 n is not a trivial task. Instead, choosing such an approach might be a viable solution only for certain combinations of h and p j . Techniques for estimating the variability as well as the cardinality of large sets such as [52] can be useful while pursuing this approach.
It should be emphasized that the class of fuzzy graphs of Definition 1 can be well considered as a typical example of higher order data. This is attributed to the inherently distributed way information is stored in a graph, in this particular case as edge existence probabilities. In order for meaningful information regarding path costs to be mined, a non-negligible fraction of edges must be crossed and, thus, the interplay of a number of edges must be considered.

3.2. Reciprocal Random Variables

Because of the Definitions 2 and 3 for δ e k and Δ p j , respectively, the properties of an inverse random variable gain more interest. The following definition is straightforward.
Definition 4.
The inverse distribution of a mass distribution function of a random variable X is defined as the mass distribution function of 1 X [53].
Property 1.
In the continuous case, the distributions of X and 1 X are linked as
f 1 X y = 1 y 2 f X 1 y , y 0 .
Proof. 
The cumulative distribution of 1 X is defined as
F 1 X y = prob 1 X y = prob X 1 y = 1 prob X < 1 y = 1 F X 1 y .
By differentiating the last relationship, the stated result follows. ☐
For instance, if X is the continuous uniform random variable in α 1 , α 2 , where α 1 , α 2 0 , then the distribution of 1 X is
f 1 X y = 1 y 2 α 2 α 1 , y 1 α 2 , 1 α 1 .
Despite its simplicity, in certain scenaria, relationship (10) cannot be used. For instance, only the first moments of X may be known or y = 0 might be a legitimate value, in which case there is a singularity in the inversion of f X x . Instead, bounds are sought for the first moments of 1 X , which makes more sense from a programming viewpoint in the case of large graphs.
Jensen inequality provides a straightforward way to bound the expected value E 1 X by using the expected value E X of the non-zero random variable X.
Theorem 1.
(Jensen inequality) For any random variable and any convex function g x provided that both the domains of X and E X are subsets of the domain of g ·
g E X E g X .
Corollary 1.
The mean value of the strictly positive random variable 1 X has a lower bound of
1 E X E 1 X , X 0 .
Property 2.
Function g x = 1 x is convex when x > 0 .
Proof. 
Notice that the second derivative g 2 x = 2 x 3 is positive when x is positive. An alternative way to prove this claim is to apply the standard convexity definition. For every α 0 0 , 1 , x 1 > 0 , and x 2 > 0
g α 0 x 1 + 1 α 0 x 2 α 0 g x 1 + 1 α 0 g x 2 1 α 0 x 1 + 1 α 0 x 2 α 0 x 1 + 1 α 0 x 2 α 0 x 2 α 0 x 1 + 1 α 0 x 2 + 1 α 0 x 1 α 0 x 1 + 1 α 0 x 2 x 1 x 2 0 α 0 1 α 0 x 1 x 2 2 0 .
In order to derive realistic upper bounds for the path lengths of a given fuzzy graph, certain probabilistic inequalities can be employed. The first is Markov inequality, which establishes a first order bound for the probability of X taking very large values by stating that
Theorem 2.
(Markov inequality) The probability of a strictly positive random variable X exceeding γ 0 is bounded by
prob X γ 0 E X γ 0 , γ 0 > 0 , X > 0 .
Second order bounds can be derived by the Chebyshev inequality. The latter provides tighter bounds while lifting the positivity assumption.
Theorem 3.
(Chebyshev inequality) The probability of an arbitrary random variable X exceeding its expected value by a certain fraction γ 0 of its standard deviation as
prob X E X γ 0 Var X 1 γ 0 2 , γ 0 > 0 .
The Chebyshev inequality is generic enough to be applied in a number of scenaria, including those in the present article. Still, it should be noted that other techniques may provide sharper bounds in certain cases. For instance, when X is normally distributed, then specialized methods exist for evaluating the integral under its tail.
Estimating the variance of a transformed random variable can be done through the delta method.
Theorem 4.
(Delta method) Let X be a random variable whose expected value E X and variance Var X are known. For an analytic g x , the variance of g X can be estimated as
Var g X g 1 E X 2 Var X .
Proof. 
The first order approximation of the Taylor expansion of g · around E X is
g X = k = 0 + g k E X X k k ! g E X + g 1 E X X E X .
Taking the variance of both sides along with the identities,
Var α 0 + α 1 X = α 1 2 Var X , Var X E X = Var X ,
yields the stated result. ☐
Corollary 2.
For g x = 1 x , the delta method yields
Var 1 X Var X E X 4 = E X 2 E X 2 E X 4 = E X 2 E X 4 1 E X 2 , E X 0 .
The Markov and Chebyshev are but two of the probabilistic inequalities collectively known as concentration inequalities, the latter bound the deviation of a random variable or a sequence of random variables from a known value. Other such inequalities include the Talagrand, the Efron–Stein, and the Dvoretzky–Kiefer–Wolfowitz inequalities.

4. Family of Walktrap Heuristics

4.1. Deterministic Walktrap

The original Walktrap algorithm [9] simulates an edge crossing random walker in order to estimate the stationary distribution of a homogeneous Markov chain. The walker can commence from any vertex and cross edges by randomly selecting destination vertices, systematically moving to vertices with high edge density as vertices are selected with probability proportional to their degree. Since vertices can be visited an arbitrary number of times, unlike algorithms like BFS and DFS, eventually some patterns in the vertex visiting sequence will emerge. As a community is from a structural perspective, essentially a locally dense graph segment, the walker is more likely to move along vertices belonging to the same community for a large time interval before moving to another community. Thus, analysis of the vertex sequence generated by the random walker can reveal the underlying graph community structure. The Walktrap algorithm is outlined in Algorithm 1.
Algorithm 1: Deterministic Walktrap
Require: graph G ( V , E ) , termination criterion τ 0
Ensure: vertex pair sequence s k , s k * is generated
1:
pick a random vertex v
2:
repeat
3:
 pick a neighboring vertex v * with probability proportional to its degree as in (23)
4:
 store current vertex in v and move the walker to the new vertex v *
and s k + 1 , s k + 1 * ( v , v * )
5:
until τ 0 is true
6:
return vertex sequence s k , s k *
The degree of any neighboring vertex can be determined by the graph adjacency matrix A defined as
A i , j = 1 , i = j i , j E 0 , i j i , j E 0 , 1 V × V .
Specifically, the degree of v k is the sum of the k-th column of A . The probability that from vertex v p a neighbor v q is selected at the next step is directly proportional to
deg v q v p , v j E deg v j = 1 ̲ V T A e ̲ V q v p , v j E 1 ̲ V T A e ̲ V j .
In contrast to many graph algorithms, each vertex may be visited more than once. In fact, vertices must be visited many times in order for meaningful patterns to emerge regarding community structure. Typically, for deterministic graphs, a constant number of visits per vertex may suffice resulting in a total of O V visits, though techniques exploiting the self-similarity nature of large, scale free graphs may yield somewhat lower bounds of O log 1 + ϵ V , ϵ > 0 . For fuzzy graphs, the linear bound is as of yet unknown as to whether it can be improved.
In a distributed setting such as Hadoop, the Deterministic Walktrap can be scaled up since graph segments can be distributed to the nodes. The map part will be the parallel random walkers crossing edges. If such a walker must cross a segment, it can either bounce back or be transferred to the appropriate node. The reduce part will be the frequency count of a large number of vertex pairs.
Once the random walker has finished crossing the graph, the communities are discovered by means of hierarchical clustering using the frequency of pairs s , s * as weights. It should be noted that other methods such as Hidden Markov Models and text mining techniques dealing with missing values [54] may be used for discerning community patterns in the sequence s k , s k * .

4.2. Fuzzy Walktrap

The Fuzzy Walktrap algorithm has been proposed and analyzed in [10]. Similarly to Algorithm 2, given a vertex v p , each neighbor v q is a candidate for being visited by the Random Walker with probability proportional to its probability of belonging to the fuzzy graph, namely proportional to
h v p , v q v p , v j E ˜ h v p , v j = 1 ̲ V T A F e ̲ V q v p , v j E 1 ̲ V T A F e ̲ V j ,
where the fuzzy adjacency matrix A F defined as
A F i , j = 1 , i = j , h e i j , e i j = v i , v j E ˜ 0 , v i , v j E ˜ . 0 , 1 V × V ,
Fuzzy Walktrap is outlined in Algorithm 2.
Algorithm 2: Fuzzy Walktrap
Require: fuzzy graph G ( V , E ˜ , h ) , termination criterion τ 0
Ensure: vertex pair sequence s k , s k * is generated
1:
pick a random vertex v
2:
repeat
3:
 pick a neighboring vertex v * with probability proportional to its cost as in (24)
4:
 store current vertex in v and move the walker to the new vertex v *
and s k + 1 , s k + 1 * ( v , v * )
5:
until τ 0 is true
6:
return s k , s k *

4.3. Markov Walktrap and Chebyshev Walktrap

Both the Markov Walktrap, proposed in [10], and Chebyshev Walktrap, introduced in this article, algorithms improve Fuzzy Walktrap in two ways. The first is that during the walking phase the Random Walker has two optional safeguards against being confined inside a community for too long. Both of these safeguards are common for Markov Walktrap and Chebyshev Walktrap. The second improvement is that, during the clustering phase, two communities may not merge if the path lengths within the resulting community exceed a certain threshold. The latter is based on first order statistics for the Markov Walktrap and on second order statistics for the Chebyshev Walktrap.
Regarding the control of community merge, for community V k , the mean path cost π k is the sum of the individual edge costs
π k = 1 V k e j V k δ e j ,
and, therefore, is also a random variable. By linearity of expectation,
E π k = E 1 V k e j V k δ e j = E k V k E δ e j .
Moreover, as δ e j are independent,
Var π k = Var 1 V k e j V k δ e j = E k V k 2 Var δ e j .
In the general case, the distribution of δ e j is unknown, for specific choices of h, it can be computed or estimated. Alternatively, since π k is a sum of random variables, its distribution may be known for certain special cases. For instance, the sum of independent Poisson random variable is another Poisson random variable. In addition, the sum of independent binomial random variables is also a binomial random variable. Finally, the sum of a large number of independent random variables with finite variance is a normal random variable according to the Central Limit Theorem. Nonetheless, in this article, no such assumptions were made and the expected value and the variance of π k were approximated by the Jensen inequality and the delta method, respectively, in Equations (13) and (18).
Since π k is by construction positive, the Markov inequality can be applied. Therefore,
prob π k γ 0 E π k γ 0
or, equivalently,
prob π k E π k γ 0 γ 0 .
If, for a threshold α 0 0 , 1 π k exceeds α 0 γ 0 , then that community is excluded from merging for an iteration provided it has at least ξ 0 vertices. This is a first order probabilistic safeguard preventing almost formed communities from losing their coherence.
On similar grounds, a second order such safeguard can be built on the Chebyshev inequality
prob π k E π k γ 1 Var π k 1 γ 1 2 .

4.4. Escape Strategies

Although the purpose of the Random Walker is to discover communities by repeatedly visiting neighboring vertices and crossing low cost edges, it is possible to be trapped inside a community if the latter is connected only through very high cost edges from the remaining graph. To this end, the Random Walker has the option as in [15] to reverse its strategy and select neighboring vertices with probability inversely proportional to their probability of existence if a random flag is triggered. The latter was implemented as a Bernoulli random variable with success probability q 0 , which can be set to zero if so desired in order to disable the weight inversion strategy. Recommended values are typically O V ϵ , ϵ 2 . Therefore, as in [15], when weight inversion is enabled, the distance d between communities implicitly depends on terms of the form
d e k E ˜ δ e k e j E ˜ δ e j ,
instead of terms of the form
d e k E ˜ δ e k .
Alternatively, a probabilistically triggered restart of the Random Walker akin to the PageRank teleportation [28], the random mutation operator in a genetic algorithm [55,56], or the restart strategy in GMRES iterative solver for linear systems [57,58] was also considered. The relocation probability q 1 has a Bernoulli distribution and is evaluated independently at each step. The number of steps before such a relocation takes place is finite however small q 1 may be as long as it remains strictly positive. The number of steps N to the first relocation has a geometric distribution with success probability equal to q 1 with mass distribution function
prob N = k = q 1 1 q 1 k , k 0 .
Therefore, the expected value and variance of N, respectively, are
E N = 1 q 1 and Var N = 1 q 1 q 1 2 .
Even though the relocation modification clearly violates the inherent locality of the Walktrap family of heuristics, if properly calibrated, it happens infrequently enough so as not to severely degrade time performance. Moreover, in a distributed system, a simple move of the random walker to the appropriate graph segment suffices and its cost is certainly affordable. Moreover, since relocation is a rare event, the total number of relocations can be modeled by a Poisson distribution.

5. Analysis

5.1. Data

In order to experimentally evaluate the performance of Markov Walktrap, a Kronecker synthetic graph [59,60,61] has been created. Kronecker graphs are recursively constructed from an original generator graph with the following model
A 0 = A , A k + 1 = A k A , n 1 ,
where A 0 is the generator graph and ⊗ denotes the Kronecker tensor product.
The generator matrix was
A 0 = 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 ,
which has p = 7 vertices numbered from 0 to 6 and each v k is connected to v k 1 , v k 2 , and v k 3 , where
k 1 = k 1 mod p , k 2 = k + 1 mod p , k 3 = k + 2 mod p .
The generator graph is shown in Figure 1.
The Kronecker model of Equation (36) has been executed six times aiming to obtain a large graph Y whose properties are summarized in Table 2.
Definitions 5 and 6 outline four important structural graph metrics.
Definition 5.
The (log)density of a fuzzy graph is the ratio of the (logarithm of the) number of its edges to the (logarithm of the) number of its vertices.
ρ 0 = E V = V 1 + ϵ V 1 + ϵ = V ϵ , 0 ϵ 1 , ρ 0 = log E log V = log V 1 + ϵ log V = 1 + ϵ .
Definition 6.
The (log)completeness of a fuzzy graph is the ratio of the (logarithm of the) number of its edges to the (logarithm of the) number of the edges of the complete graph with the same number of vertices.
σ 0 = E V 2 = 2 E V V 1 2 E 2 V 2 = 2 V ϵ 1 , σ 0 = log E log V 2 = log E log V V 1 2 = log E log V + log V 1 1 log E 2 log V = ρ 0 2 = 1 + ϵ 2 .
Observe that Y is highly connected as it has a low diameter of a relatively low cost, high average degree, and a high number of triangles and squares.

5.2. Time and Memory Requirements

In Table 3, the total execution time for Chebyshev Walktrap (CW), Markov Walktrap (MW), Fuzzy Walktrap (FW), and Fuzzy Newman–Girvan (FN-G) is shown. The last two algorithms are outlined in [10], whereas Fuzzy Markov was proposed in [15]. The effect of the escape mechanisms of weight inversion (I) and relocation (R) are also shown for Markov Walktrap and Chebyshev Walktrap. The Fuzzy Newman–Girvan is an exhaustive algorithm that will serve as baseline both for the requirements and the clustering quality. For the Walktrap algorithms, the time for the two phases, namely random walking (RW) and community building (CB), are recorded separately, while for the Fuzzy Newman–Girvan case only, the total time is recorded as there is only a single phase. In addition, the last column of Table 3 lists the number of the vertices visited by the random walker.
Fuzzy Newman–Girvan is considerably slower than any member of the Walktrap family of algorithms. This can be attributed to the exhaustive nature of Fuzzy Newman–Girvan as well as to the extensive use of locality by the Walktrap family. Moreover, the probabilistic constraints of Markov Walktrap and Fuzzy Walktrap resulted in the acceleration of both phases of the respective algorithms, with the second order constraints yielding the lowest times in each case. Concerning the escape strategy of the random walker, the relocation option resulted in a slower walking phase but in an accelerated community building phase, with that combination being more efficient than both weight inversion and the combination of the two escape strategies. Omitting an escape strategy is not advisable. Therefore, it is not recommended to activate both escape strategies at the same time. At any rate, the Chebyshev Walktrap with relocation (CW+R) had the best overall performance tagged along in a close manner by the Markov Walktrap with relocation (MW+R). The original Fuzzy Markov being the tardiest member of the family.
An explanation for the time achieved under the relocation strategy is that the teleportation of the random walker results in cache misses, which translates to expensive fetch cycles in the memory hierarchy system. This can be seen in the last two columns of Table 3, as there is not a clear correspondence between the number of total visits and the total walking phase time. When relocation is enabled, the mean visit time is clearly higher. At any rate, the number of visits is linear in the vertex set cardinality.
In addition, the selection of h did not appear to have a significant performance impact, although in most cases the random walker was slower when h was a Poisson random variable both in terms of time and in terms of total visits. This can be attributed to the large number of high cost edges which forced the walker to bounce more times inside a community before eventually moving to another. On the other hand, the symmetric form of the binomial distribution mass function resulted in a larger number of low cost edges, facilitating the movement of the random walker and making the communities easily separable compared to the Poisson case.
The memory requirements were monitored with the Ubuntu Watch administrative tool as presented in Table 4. In contrast to other similar tools such as htop, Watch generates a text output which can be parsed and analyzed. It was periodically ran every 10 s through a bash script resulting in records of several thousand of entries each.
Fuzzy Newman–Girvan consumes more memory than any other algorithm presented in this article by far. However, it utlilizes the memory constantly and consistently, as denoted by the relatively low standard deviation. This is an important performance feature for operating systems process schedulers [46]. On the other hand, the Walktrap family exploits graph caching. This in turn translates to lower traffic between the disk and the memory, as Neo4j is not a memory-only database, as well as to fewer synchronization and serialization operations. When the relocation strategy is selected, then memory utilization has certain spikes, as it can be inferred from the increased maximum memory occupied and the increased standard deviation. This is a direct result of the random walker teleportation which temporarily annuls any scheduling optimization as well as any caching done at the software or hardware level.

5.3. Community Coherence

The following definition will facilitate further analysis of the experimental results.
Definition 7.
The (log)scree plot of a set S is the plot of the (logarithm of the) values of S versus their sorted frequency.
Since Y does not contain ground truth communities, the communities obtained by the Fuzzy Newman–Girvan will be used as a baseline reference since their sizes are closer to a power law distribution, which is an essential feature of large, scale-free graphs. The deviation ξ of a set of numbers x k = 1 n from a power law
f k = α 0 k γ 0 , 1 k n , α 0 > 0 , γ 0 > 0
is quantified by the formula [62,63]
ξ = 1 n k = 1 n log f k log α 0 γ 0 log k 2 ,
where parameters α 0 and γ 0 can be estimated by, for instance, a least squares method [25]. Additionally, the estimated value of α 0 serves as a quality indicator, as it should be as close to [ 2 , 3 ] as possible.
The number of communities for each algorithm are shown in Table 5. Notice that this is not an absolute clustering quality metric, as typically a large number of coherent communities is preferable to a smaller number of sparse ones. Nonetheless, the introduction of the relocation strategy systematically pushes the number of communities towards the reference number, although more evidence is required for determining community coherence. This will be addressed by the two asymmetric indices of this section.
In order to evaluate the clustering quality, the Kullback–Leibler divergence between the sorted sizes of the communities generated by the Fuzzy Newman–Girvan and the sorted community sizes of the remaining algorithms was computed. Recall that for two discrete distributions p k and q k the Kullback–Leibler divergence is defined as
p | | q = k = 1 n p k log p k q k = k = 1 n p k log p k k = 1 n p k log q k ,
where k ranges over the union of discrete events. If p k and q k have no events in common, then the result is undefined. If for a single event p k = 0 or q k = 0 , then the corresponding summand is zero. Table 6 summarizes the divergence for the Poisson and the binomial cases.
Chebyshev Walktrap with relocation outperforms the remaining algorithms, as it has less divergence from the reference distribution.
A question at this point is whether a correspondence between the communities returned by each algorithm can be found. The asymmetric Tversky index between two sets T and V is defined as
ν T , V = T V T V + w 1 T V + w 2 V T , w 1 , w 2 > 0 ,
and it quantifies the distance between the template set T and the variant set V. By the very definition of the index, the template set T and the variant set V are not interchangeable, namely ν T , V ν V , T . This agrees with intuition, as it makes sense to ask how much the heuristic results differ from the ground truth community, whereas there is no point in asking the inverse question. On the contrary, with a symmetric distance metric such as, for instance, the Tanimoto similarity coefficient
τ T , V = T V T V = T V T + V T V ,
no distinction can be made between the template and the variant, which can potentially lead to misleading results.
At this point, it should be highlighted that Fuzzy Newman Girvan was executed only once since it is a deterministic algorithm.
Returning to Label (44), the case w 1 + w 2 = 1 is of particular interest in data mining, as it confines the coefficients on the plane which maximizes the minimum distance of T from V. Notice that algebraically this asymmetry stems from both the terms T V and V T , which denote the number of elements of T not found in V and vice versa. Both terms signify in their own way how V is different from T. The former corresponds to the part of V which is missing from T, whereas the latter corresponds to any additions to V. As a rule, T V is more important and, consequently, w 1 > w 2 . As there is no standard rule for selecting w 1 and w 2 , the following two schemes have been used, a linear
w 1 = s 1 + s = 1 1 + 1 s , w 2 = 1 1 + s , 1 s 5
and an exponential
w 1 = e s 1 + e s = 1 1 + e s , w 2 = 1 1 + e s , 0 s 2 .
Observe that in the first case w 1 w 2 = s , while in the second w 1 w 2 = e s , which clearly represents a non-linear scaling of the first case. Furthermore, the second case is considerably biased in favor of T V .
Once for each possible pair of the m ground truth communities T i 1 i m and the n estimated ones V j 1 j n the m n Tversky indices have been computed, the similarity score J s for a given s is computed
J s = 1 m n i = 1 m j = 1 n ν T i , V j .
Summing over the range of s and taking the average, the mean similarity score J ¯ is obtained
J ¯ = 1 s s J s .
The overall similarity scores are shown in Table 7.
Again, Chebyshev Walktrap with relocation outperforms the remaining algorithms as it has the highest similarity with the reference communities. Note that the exponential weighting scheme sharpens the difference between the algorithms by raising the maximum scores and lowering the minimum ones.
For the experiments of the section, the termination criterion τ 0 was chosen to be a user supplied number of iterations, namely V log V . This number of iterations is sufficiently large for generating communities in a reliable way. Moreover, each iteration is very quick, so the overall execution time was kept at an acceptable level despite the large number of iterations.

5.4. Relocations

Analysis is concluded with a summary regarding the relocations made by the Chebyshev Walktrap and the Markov Walktrap.
In Table 8, certain statistics regarding the random walker relocations are shown. Specifically, the first line presents the total number of relocations, whereas the second line shows the number of steps that the random walker makes before being relocated for the first time. Similarly, the last three lines contain the minimum, maximum and average number of steps between two successive relocations, respectively.

6. Conclusions

The primary contribution of this article is the implementation over Neo4j of Chebyshev Walktrap, a community discovery algorithm designed for edge-fuzzy graphs, a class of fuzzy graphs used among others in [10], which is based on the Random Walker algorithmic principle. Additionally, Chebyshev Walktrap relies on the competitive factors of second order statistics though the Chebyshev inequality and on an optional relocation capability in order to bound unnecessarily costly walks and, thus, remaining inside a community and being trapped for too long within the boundaries of a community, respectively. The relocation aspect was also backported to the Markov Walktrap algorithm first proposed in [15]. The effect of relocation on the community coherence was evaluated based on the asymmetric Tversky index using the Fuzzy Newman–Girvan algorithm from [15] as baseline, while its effect on the output distribution was assessed with the asymmetric Kullback–Leibler divergence. The latter was also the basis for evaluating the distance between the community size distribution generated by Fuzzy Newman–Girvan and the one computed by the Makrov Walktrap and the Chebyshev Walktrap. In these cases, the introduction of asymmetry resulted in the clear distinction between the baseline data and their variants.
The test dataset was a large synthetic Kronecker graph whose edge fuzziness was controlled either by a binomial or by a Poisson distribution. In this dataset, our performance metrics showed that Chebyshev Walktrap yields more compact communities whose sizes are more clustered. Additionally, Markov Walktrap is, in many instances, slightly faster at the expense of a somewhat bigger memory footprint.
The experimental results of Section 5 hint at some future research directions. More sophisticated from a probabilistic viewpoint, community discovery algorithms should be able to exploit the asymmetry of the edge fuzziness distribution through higher order concentration inequalities such as the Talagrand inequality, provided their computation is efficient. Moreover, new metrics for community matching, perhaps utilizing functional or semantic information should be developed. Additionally, methodologies for reliably assessing community coherence based on higher order structural or functional interactions should be sought. Finally, more experiments in larger graphs should be conducted in order to determine any inherent scalability limitations.

Author Contributions

Georgios Drakopoulos, Andreas Kanavos, and Konstantinos Tsakalidis conceived of the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript and revised the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zurek, W.H. Algorithmic Information Content, Church-Turing Thesis, Physical Entropy, and Maxwell’s Demon; Technical Report; Los Alamos National Lab.: Los Alamos, NM, USA, 1990. [Google Scholar]
  2. Brillouin, L. Maxwell’s demon cannot operate: Information and entropy. I. J. Appl. Phys. 1951, 22, 334–337. [Google Scholar] [CrossRef]
  3. Herrmann, D. Heron von Alexandria. In Die antike Mathematik; Springer: Berlin/Heidelberg, Germany, 2014; pp. 257–288. [Google Scholar]
  4. Fouss, F.; Pirotte, A.; Renders, J.M.; Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Trans. Knowl. Data Eng. 2007, 19, 355–369. [Google Scholar]
  5. Sinop, A.K.; Grady, L. A seeded image segmentation framework unifying graph cuts and random walker which yields a new algorithm. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 2007; pp. 1–8. [Google Scholar]
  6. Couprie, C.; Grady, L.; Najman, L.; Talbot, H. Power watersheds: A new image segmentation framework extending graph cuts, random walker and optimal spanning forest. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 731–738. [Google Scholar]
  7. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of community hierarchies in large networks. J. Stat. Mech. Theory Exp. 2008. [Google Scholar] [CrossRef]
  8. Girvan, M.; Newman, M. Community Structure in Social and Biological Networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [PubMed]
  9. Pons, P.; Latapy, M. Computing Communities in Large Networks Using Random Walks. Available online: https://arxiv.org/abs/physics/0512106 (accessed on 28 March 2017).
  10. Drakopoulos, G.; Kanavos, A.; Makris, C.; Megalooikonomou, V. On converting community detection algorithms for fuzzy graphs in Neo4j. In Proceedings of the 5th International Workshop on Combinations of Intelligent Methods and Applications (CIMA), Vietri sul Mare, Italy, 9–11 November 2015. [Google Scholar]
  11. Rosvall, M.; Bergstrom, C. Maps of Information Flow Reveal Community Structure in Complex Networks. Technical Report. Available online: https://arxiv.org/abs/0707.0609 (accessed on 28 March 2017).
  12. Drakopoulos, G.; Kanavos, A. Tensor-based Document Retrieval over Neo4j with an Application to PubMed Mining. In Proceedings of the 7th International Conference of Information, Intelligence, Systems, and Applications (IISA 2016), Chalkidiki, Greece, 13–15 July 2016. [Google Scholar]
  13. Panzarino, O.P. Learning Cypher; PACKT Publishing: Birmingham, UK, 2014. [Google Scholar]
  14. Rosenfeld, A. Fuzzy Graphs. Fuzzy Sets Appl. 1975, 513, 77–95. [Google Scholar]
  15. Drakopoulos, G.; Kanavos, A.; Tsakalidis, A. A Neo4j implementation of fuzzy random walkers. In Proceedings of the 9th Hellenic Conference on Artificial Intelligence (SETN 2016), Thessaloniki, Greece, 18–20 May 2016. [Google Scholar]
  16. Robinson, I.; Webber, J.; Eifrem, E. Graph Databases; O’Reilly: Sebastopol, CA, USA, 2013. [Google Scholar]
  17. Fortunato, S. Community Detection in Graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
  18. Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and an algorithm. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2001), Vancouver, BC, Canada, 3–8 December 2001. [Google Scholar]
  19. Kernighan, B.; Lin, S. An Efficient Heuristic Procedure for Partitioning Graphs. Bell Syst. Tech. J. 1970, 49, 291–307. [Google Scholar] [CrossRef]
  20. Shi, J.; Malik, J. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
  21. Lancichinetti, A.; Fortunato, S. Community Detection Algorithms: A Comparative Analysis. Phys. Rev. E 2009, 80, 056117. [Google Scholar] [CrossRef] [PubMed]
  22. Leskovec, J.; Lang, K.J.; Mahoney, M.W. Empirical Comparison of Algorithms for Network Community Detection. In Proceedings of the 19th International Conference on World Wide Web (WWW 2010), Raleigh, NC, USA, 26–30 April 2010; pp. 631–640. [Google Scholar]
  23. Carrington, P.J.; Scott, J.; Wasserman, S. Models and Methods in Social Network Analysis; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  24. Scott, J. Social Network Analysis: A Handbook; SAGE Publications: Thousand Oaks, CA, USA, 2000. [Google Scholar]
  25. Drakopoulos, G.; Kanavos, A.; Tsakalidis, A. Evaluating Twitter Influence Ranking with System Theory. In Proceedings of the 12th International Conference on Web Information Systems and Technologies (WEBIST), Rome, Italy, 23–25 April 2016. [Google Scholar]
  26. Kontopoulos, S.; Drakopoulos, G. A space efficient scheme for graph representation. In Proceedings of the 26th International Conference on Tools with Artificial Intelligence (ICTAI 2014), Limassol, Cyprus, 10–12 November 2014; pp. 299–303. [Google Scholar]
  27. Langville, A.; Meyer, C. Google’s PageRank and Beyond: The Science of Search Engine Rankings; Princeton University Press: Princeton, NJ, USA, 2006. [Google Scholar]
  28. Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web; Stanford InfoLab: Stanford, CA, USA, 1999. [Google Scholar]
  29. Kleinberg, J.M. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the Symposium of Discrete Algorithms (SODA), San Francisco, CA, USA, 25–27 January 1998; pp. 668–677. [Google Scholar]
  30. Newman, M. Networks: An Introduction; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
  31. Newman, M.E. Fast Algorithm for Detecting Community Structure in Networks. Phys. Rev. E 2004, 69, 066133. [Google Scholar] [CrossRef] [PubMed]
  32. Kafeza, E.; Kanavos, A.; Makris, C.; Chiu, D. Identifying Personality-based Communities in Social Networks. In Proceedings of the Legal and Social Aspects in Web Modeling (Keynote Speech) in Conjunction with the International Conference on Conceptual Modeling (ER) (LSAWM), Hong Kong, China, 11–13 November 2013. [Google Scholar]
  33. Kafeza, E.; Kanavos, A.; Makris, C.; Vikatos, P. Predicting Information Diffusion Patterns in Twitter. In Proceedings of the Artificial Intelligence Applications and Innovations (AIAI), Rhodes, Greece, 19–21 September 2014; pp. 79–89. [Google Scholar]
  34. Kanavos, A.; Perikos, I. Towards Detecting Emotional Communities in Twitter. In Proceedings of the 9th IEEE International Conference on Research Challenges in Information Science (RCIS), Athens, Greece, 13–15 May 2015; pp. 524–525. [Google Scholar]
  35. Kafeza, E.; Kanavos, A.; Makris, C.; Vikatos, P. T-PICE: Twitter Personality based Influential Communities Extraction System. In Proceedings of the IEEE International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 212–219. [Google Scholar]
  36. Zamparas, V.; Kanavos, A.; Makris, C. Real Time Analytics for Measuring User Influence on Twitter. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, Italy, 9–11 November 2015. [Google Scholar]
  37. Kanavos, A.; Perikos, I.; Vikatos, P.; Hatzilygeroudis, I.; Makris, C.; Tsakalidis, A. Conversation Emotional Modeling in Social Networks. In Proceedings of the 26th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Limassol, Cyprus, 10–12 November 2014; pp. 478–484. [Google Scholar]
  38. Kanavos, A.; Perikos, I.; Vikatos, P.; Hatzilygeroudis, I.; Makris, C.; Tsakalidis, A. Modeling Retweet Diffusion using Emotional Content. In Proceedings of the Artificial Intelligence Applications and Innovations (AIAI), Rhodes, Greece, 19–21 September 2014; pp. 101–110. [Google Scholar]
  39. Kephart, J.O.; White, S.R. Directed-graph epidemiological models of computer viruses. In Proceedings of the 1991 IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, CA, USA, 20–22 May 1991; pp. 343–359. [Google Scholar]
  40. Ren, J.; Yang, X.; Yang, L.X.; Xu, Y.; Yang, F. A delayed computer virus propagation model and its dynamics. Chaos Solitons Fractals 2012, 45, 74–79. [Google Scholar] [CrossRef]
  41. Tugnait, J.K.; Luo, W. On channel estimation using superimposed training and first-order statistics. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), Hong Kong, China, 6–10 April 2003; pp. 4–624. [Google Scholar]
  42. Tong, L.; Xu, G.; Kailath, T. Blind identification and equalization based on second-order statistics: A time domain approach. IEEE Trans. Inf. Theory 1994, 40, 340–349. [Google Scholar] [CrossRef]
  43. Belouchrani, A.; Abed-Meraim, K.; Cardoso, J.F.; Moulines, E. A blind source separation technique using second-order statistics. IEEE Trans. Signal Process. 1997, 45, 434–444. [Google Scholar] [CrossRef]
  44. Mendel, J.M. Tutorial on higher-order statistics (spectra) in signal processing and system theory: Theoretical results and some applications. Proc. IEEE 1991, 79, 278–305. [Google Scholar] [CrossRef]
  45. Chua, K.C.; Chandran, V.; Acharya, U.R.; Lim, C.M. Application of higher order statistics (spectra) in biomedical signals—A review. Med. Eng. Phys. 2010, 32, 679–689. [Google Scholar] [PubMed]
  46. Drakopoulos, G.; Megalooikonomou, V. An adaptive higher order scheduling policy with an application to biosignal processing. In Proceedings of the 2016 Symposium Series on Computational Intelligence (SSCI 2016), Athens, Greece, 6–9 December 2016; pp. 921–928. [Google Scholar]
  47. Comon, P. Independent component analysis, a new concept? Signal Process. 1994, 36, 287–314. [Google Scholar] [CrossRef]
  48. Delorme, A.; Sejnowski, T.; Makeig, S. Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis. Neuroimage 2007, 34, 1443–1449. [Google Scholar] [CrossRef] [PubMed]
  49. Priest, D.M. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 26–28 June 1991; pp. 132–143. [Google Scholar]
  50. Drakopoulos, G. Tensor fusion of affective Twitter metrics in Neo4j. In Proceedings of the 7th International Conference of Information, Intelligence, Systems, and Applications (IISA 2016), Chalkidiki, Greece, 13–15 July 2016. [Google Scholar]
  51. Drakopoulos, G.; Megalooikonomou, V. Regularizing Large Biosignals with Finite Differences. In Proceedings of the 7th International Conference of Information, Intelligence, Systems, and Applications (IISA 2016), Chalkidiki, Greece, 13–15 July 2016. [Google Scholar]
  52. Drakopoulos, G.; Kontopoulos, S.; Makris, C. Eventually consistent cardinality estimation with applications in biodata mining. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy, 4–8 April 2016; pp. 941–944. [Google Scholar]
  53. Hamming, R.W. On the distribution of numbers. Bell Syst. Tech. J. 1970, 49, 1609–1625. [Google Scholar]
  54. Aggarwal, C.C.; Zhai, C. Mining Text Data; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  55. De Jong, K. Learning with genetic algorithms: An overview. Mach. Learn. 1988, 3, 121–138. [Google Scholar] [CrossRef]
  56. De Jong, K.A.; Spears, W.M. Using Genetic Algorithms to Solve NP-Complete Problems. In Proceedings of the ICGA, Fairfax, VA, USA, 4–7 June 1989; pp. 124–132. [Google Scholar]
  57. Saad, Y.; Schultz, M.H. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 1986, 7, 856–869. [Google Scholar] [CrossRef]
  58. Morgan, R.B. Implicitly restarted GMRES and Arnoldi methods for nonsymmetric systems of equations. SIAM J. Matrix Anal. Appl. 2000, 21, 1112–1135. [Google Scholar] [CrossRef]
  59. Leskovec, J.; Chakrabarti, D.; Kleinberg, J.; Faloutsos, C.; Ghahramani, Z. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res. 2010, 11, 985–1042. [Google Scholar]
  60. Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graphs over time: Densification laws, shrinking diameters and possible explanations. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD05), Chicago, IL, USA, 21–24 August 2005; pp. 177–187. [Google Scholar]
  61. Tsourakakis, C.E. Fast counting of triangles in large real networks without counting: Algorithms and laws. In Proceedings of the ICDM, Pisa, Italy, 15–19 December 2008; pp. 608–617. [Google Scholar]
  62. Drakopoulos, G.; Kanavos, A.; Makris, C.; Megalooikonomou, V. Finding fuzzy communities in Neo4j. In Smart Innovation, Systems, and Technologies; Howlett, R.J., Jain, L.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  63. Drakopoulos, G.; Baroutiadi, A.; Megalooikonomou, V. Higher order graph centrality measures for Neo4j. In Proceedings of the 6th International Conference of Information, Intelligence, Systems, and Applications (IISA 2015), Corfu, Greece, 6–8 July 2015. [Google Scholar]
Figure 1. The generator graph.
Figure 1. The generator graph.
Algorithms 10 00040 g001
Table 1. Paper notation.
Table 1. Paper notation.
SymbolMeaning
= Definition or equality by definition
Kronecker tensor product
s 1 , , s n Set with elements s 1 , s 2 , , s n
· Set cardinality or path length (depending on the context)
e 1 , , e m Path comprised of edges e 1 , , e m
K n Complete graph with n vertices and n 2 edges
E X Mean value of random variable X
Var X Variance of random variable X
τ T , V Tanimoto similarity coefficient for sets T and V
ν T , V Asymmetric Tversky index for sets T and V
S 1 S 2 Asymmetric set difference S 1 minus S 2
S ˜ Fuzzy set S
s k Sequence of elements s k
H s 1 , , s n Harmonic mean of elements s 1 , , s n
H s 1 , , s n ; τ 0 Thresholded or effective harmonic mean of s 1 , , s n
1 ̲ n n × 1 vector with ones
e ̲ n k n × 1 zero vector with a single one at the k-th entry
f n x n-th order derivative of f x
p | | q Kullback–Leibler divergence between distributions p and q
Table 2. Structural properties of graph Y.
Table 2. Structural properties of graph Y.
PropertyValuePropertyValuePropertyValuePropertyValue
Vertices117,649 σ 0 7.68 × 10 4 Triangles37127Squares1981
Edges5,315,625 σ 0 0.6631 min cost 5.1891 max cost 10.7232
ρ 0 45.1821 Diameter length19avg cost 72.1149 avg cost 106.0012
ρ 0 1.3263 Diameter cost 124.4021 max cost 143.2716 max cost 198.2221
Table 3. Performance in terms of time (sec) and vertex visits.
Table 3. Performance in terms of time (sec) and vertex visits.
AlgorithmDistributionRW (s)CB (s)Total (s)Visits
CWPoisson 128.1381 2000.9831 2129.1212 471,858
CWBinomial 131.2182 2000.0304 2131.2486 470,342
CW + IPoisson 144.4752 1911.9118 2056.3870 477,423
CW + IBinomial 139.9916 1904.0216 2044.0132 477,216
CW + RPoisson 157.0104 1840.0013 1997.0117 476,997
CW + RBinomial 157.6633 1850.1003 2007.7636 476,957
CW + IRPoisson 164.3779 2002.7745 2167.1524 477,762
CW + IRBinomial 162.0222 1911.8664 2073.8886 477,002
MWPoisson 157.5248 2104.9918 2262.5166 475,308
MWBinomial 156.9924 2099.5256 2256.5180 475,121
MW + IPoisson 165.3374 1908.6612 2073.9986 478,313
MW + IBinomial 163.0015 1902.4319 2065.4334 478,102
MW + RPoisson 173.9016 1850.0971 2023.9987 477,916
MW + RBinomial 171.0017 1840.0054 2011.0071 477,831
MW + IRPoisson 181.0013 2000.9917 2181.9930 478,514
MW + IRBinomial 178.0017 2000.8585 2178.8602 478,333
FWPoisson 191.9989 2425.1121 2617.1110 515,444
FWBinomial 184.0451 2417.3376 2601.3827 514,312
FN-GPoisson 9322.9514
FN-GBinomial 9344.7778
Table 4. Performance in terms of memory (rounded in MBs).
Table 4. Performance in terms of memory (rounded in MBs).
AlgorithmDistributionminmaxmeanstd
CWPoisson412867424892322
CWBinomial412867444880324
CW + IPoisson412867394886335
CW + IBinomial412867444881338
CW + RPoisson412880025113427
CW + RBinomial412880005121422
CW + IRPoisson412880015208345
CW + IRBinomial412880025214339
MWPoisson412867444800331
MWBinomial412867404881325
MW + IPoisson412867394879336
MW + IBinomial412867514883335
MW + RPoisson412880045112428
MW + RBinomial412880025108428
MW + IRPoisson412880025200343
MW + IRBinomial412880005204341
FWPoisson412867504912281
FWBinomial412867424910280
FN-GPoisson819212,40211,012278
FN-GBinomial819212,46411,121280
Table 5. Number of communities.
Table 5. Number of communities.
CWCW + ICW + RCW + IRMWMW + IMW + RMW + IRFWFN-G
Poisson13,751137,5813,33213,44313,76113,78913,45613,80414,12712,816
Binomial18,84118,91216,09017,62118,87718,80117,00218,81118,89115,117
Table 6. Kullback–Leibler divergence.
Table 6. Kullback–Leibler divergence.
CWCW + ICW + RCW + IRMWMW + IMW + RMW + IRFW
Poisson 0.6409 0.6311 0.2766 0.4504 0.3826 0.3911 0.3281 0.4519 0.8012
Binomial 0.3885 0.3977 0.3519 0.3700 0.5804 0.5687 0.6140 0.5626 0.6748
Table 7. Tversky index.
Table 7. Tversky index.
LinearCWCW + ICW + RCW + IRMWMW + IMW + RMW + IRFW
Poisson 0.6199 0.6126 0.6616 0.6012 0.5881 0.5902 0.6211 0.5577 0.2742
Binomial 0.4323 0.4153 0.6059 0.5853 0.5648 0.5702 0.5814 0.5332 0.2131
ExponentialCWCW + ICW + RCW + IRMWMW + IMW + RMW + IRFW
Poisson 0.5505 0.5345 0.7503 0.5462 0.5271 0.5383 0.7354 0.4811 0.1703
Binomial 0.5230 0.4841 0.6992 0.4737 0.5102 0.4890 0.6772 0.4283 0.1523
Table 8. Relocation summary.
Table 8. Relocation summary.
CW + RMW + R
Number of relocations1814
First relocation step10144
min between relocations17,81232,991
max between relocations27,09964,818
mean between relocations21,00238,002

Share and Cite

MDPI and ACS Style

Drakopoulos, G.; Kanavos, A.; Tsakalidis, K. Fuzzy Random Walkers with Second Order Bounds: An Asymmetric Analysis. Algorithms 2017, 10, 40. https://doi.org/10.3390/a10020040

AMA Style

Drakopoulos G, Kanavos A, Tsakalidis K. Fuzzy Random Walkers with Second Order Bounds: An Asymmetric Analysis. Algorithms. 2017; 10(2):40. https://doi.org/10.3390/a10020040

Chicago/Turabian Style

Drakopoulos, Georgios, Andreas Kanavos, and Konstantinos Tsakalidis. 2017. "Fuzzy Random Walkers with Second Order Bounds: An Asymmetric Analysis" Algorithms 10, no. 2: 40. https://doi.org/10.3390/a10020040

APA Style

Drakopoulos, G., Kanavos, A., & Tsakalidis, K. (2017). Fuzzy Random Walkers with Second Order Bounds: An Asymmetric Analysis. Algorithms, 10(2), 40. https://doi.org/10.3390/a10020040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop