1 Introduction

We live in a networked world where almost everything can be represented as a network, including online social networks (OSNs) [12], a virus outbreak [54], food recipes [61], and a city’s water supply system [20]. One attribute of complex networks is the formation of communities [29], which are clusters of relatively densely connected vertices with respect to the rest of the network [29]. For instance, a group of OSN users who share a common subject of interest [44], a team of coworkers, exposing each other to virus transmission [31], a family of ingredients from a certain cuisine [2], or a city neighborhood corresponding to its water supply system [14]. Analyzing such a community-structured network can help us gain meaningful insights into the objects represented by communities. For example, the detection of OSN communities promoting violent extremism and radicalization [9], finding a group with a high potential to cause a pandemic outbreak [65], recommending recipe-enhancing ingredients substitution [61], and locating a neighborhood likely to suffer from a water supply breakdown [52].

The ability to detect anomalous communities is crucial to the deduction of insights from community-structured networks [36]. Such insights could help humanity cease a pandemic by an early reveal of a virus’ hot spots [28], identify groups of fake profiles spreading fake news [15], or prevent targeted violence toward minorities by uncovering hatred-inciting communities in an online social network [43].

Over the last two decades, both industry and academic researchers proposed various solutions to address the problem of anomaly detection in networks [3, 25, 39, 40, 50, 51], aiming to utilize them to gain insights into the analyzed networks. Altshuler et al. [5] demonstrated that by applying an anomaly detection algorithm on call recording logs of a country’s mobile network, they could classify events appearing in a particular time period as emergencies or not. While the majority of the conducted studies are mainly focused on uncovering anomalous vertices, only a handful focus on detecting anomalous communities [7, 13, 33, 45, 48, 53, 59, 70].

Most of these methods fail in scenarios where the anomalous communities are concealed properly in the background network, having similar properties as the rest of the communities. For example, methods based on density and fraction of cross-boundary edges, such as Conductance [30], tend to achieve poor results as the anomalous communities become either more sparse or harder to separate from other communities.

In this study, we introduce a novel generic network analysis and machine-learning-based algorithm to detect anomalous communities in complex networks. Motivated by Kagan et al. [40], demonstrating anomalous vertices could be determined by the number of improbable edges they have, we have hypothesized a community composed of many unexpected vertices, has a higher chance of being anomalous.

The approach we employ to test our hypothesis undergoes a classification problem. We predict the affiliation probability of vertices to their communities, and aggregate the resulting probabilities to determine the “normality” of each community. We adopt a novel concept of utilizing the information of vertices’ co-membership between communities, by formulating the aforementioned classification problem as a link-prediction problem in a new utility network, in which vertices are connected to community-representing vertices if they belong to the corresponding communities in the original network. More simply stated, we create a new network encapsulating the information of co-membership and then applies an anomaly detection algorithm to it.

The two significant advantages of our method are: (a) It utilizes only information of vertices’ co-membership to communities, which makes it agnostic to the density of communities. It can only be affected by the fraction of cross-boundary edges they contain, and specifically, to improve as it increases; and (b) the co-membership information is converted to a network, of which only its structural features are extracted and utilized, which makes it independent of a specific domain. Our method succeeds in cases when communities are hard to separate from the background. Additionally, it is generic, unlike most of the studies in this field (see Sect. 2.2).

To evaluate our algorithm, we utilized two types of labeled datasets (see Sect. 4.1): (a) Fully simulated random generated networks, infused with anomalous communities; and (b) real-world networks infused with anomalous communities, that is, forcibly connecting randomly generated anomalous communities to other real communities in the real-world data. Additionally, we applied our method to two unlabeled real-world networks collected from RedditFootnote 1 and Hebrew WikipediaFootnote 2 revisions information. The results demonstrate that our method successfully identifies anomalous communities in all cases. In the first two cases, where simulated data or real-data perturbation are involved, causing them to be labeled datasets, our algorithm outperformed the baselines when the anomalous communities were sparse, small, and contained many cross-boundary connections. In the first unlabeled case, Reddit, our algorithm was able to identify two communities (subreddits) that presented peculiar activity, such as an utter failure of collaboration. In Wikipedia, where we depict an article as a community and the Wikipedians (editors) who edit it as vertices, our algorithm uncovered articles tainted with political agenda or violent content due to trolling.

The key contributions of our study are threefold:

  • We have developed a novel generic algorithm for uncovering anomalous communities in complex networks and demonstrated that our algorithm could uncover real-world anomalous communities.

  • We present a novel random network generation algorithm, which generates a random community-structured network, infused with anomalous communities. The algorithm includes the ability to control the number, size, and type (network generating algorithm) of the normal communities and the infused anomalous communities. The algorithm is well-suited to conduct research in the field of anomalous community detection in networks.

  • We have developed and published online an open-source code of this study’s framework as well as the labeled datasets we created and utilized throughout this study to facilitate future research in the field of anomalous community detection in complex networks.

The remainder of this paper is organized as follows: Sect. 2 provides a brief overview of previous studies on anomalous vertices detection in networks, and on anomalous subgraphs and communities detection in networks. In Sect. 3, we describe in detail our anomalous community detection method and our anomaly-infused community-structured random network generator. In Sect. 4 we describe the collection and creation of the networks’ datasets used to evaluate our algorithm, and the experiments performed to evaluate it. In Sect. 5, we report our study’s results. In Sect. 6, we analyze the evaluation results and insights and discuss our algorithm’s limitations. Lastly, in Sect. 7, we present our conclusion and offer future research directions.

2 Related Work

Detecting anomalies in network-based data is an essential task for countless applications and areas [4]. In recent times, the ability to detect anomalous communities in networks, rather than stand-alone anomalies, became a high-impact task as well [48]. The following section surveys studies on anomalous vertices detection and anomalous communities detection in networks.

2.1 Anomalous Vertices Detection

Research and technology in the field of anomaly detection in networks have significantly evolved in the last two decades [4]. Noble and Cook [50] were one of the first to study anomaly detection in network-based data. Their method assumed infrequently occurring substructures indicate anomalous behavior, while normal substructures reoccur much more often. Jimeng et al. [39] proposed a method for detecting anomalous vertices in bipartite networks, where they calculated relevance scores between vertices and aggregated the results to one score per vertex, where a low score indicated an anomaly. Papadimitriou et al. [51] presented a method to identify anomalies in a web network by comparing consecutive pairs of snapshots of the network by calculating similarity measures between them, where scores that were too low or high indicated an anomaly. In the same year, Akoglu et al. [3] proposed OddBall, a feature-based method to detect outliers in weighted networks. They chose pairs of features extracted from a vertex’s egonet, whose patterns obey power-laws. Vertices with significant deviation from the patterns are considered outliers. Fire et al. [25] presented a method for detecting fake profiles on online social networks based on anomalies in a fake user’s social structure, namely the topology of the network. The method followed the intuition a fake profile randomly connects to other users in the network. Kagan et al. [40] proposed a generic unsupervised algorithm able to detect anomalous vertices based on network topological features alone, with the idea a vertex with many improbable edges has a higher likelihood of being anomalous. Ding et al. [21] introduced an interactive deep-learning-based approach, which allows the system to proactively communicate with the end-user to mitigate the lack of labeled data and enhance the anomaly detection performance. Recently, Gutiérrez-Gómez et al. [34] proposed MADAN, a parallelized method to rank and localize outlier vertices within their contexts, at different scales of a network, where they utilize heat kernel to smoothen signals around vertices at each scale, and the remaining highly concentrated signals after smoothing point to anomalous vertices.

2.2 Anomalous Subgraphs and Communities Detection

In recent years, due to the increase in volume and sophistication of cyber-threats [37], the ability to detect a group of entities whose linkage is abnormal regarding the other network’s edges, namely, the detection of anomalous communities, has become a necessity and a valuable field of research [48].

Singh et al. [59] were the first to address the problem of anomalous subgraph detection rather than single anomalous vertices detection, by utilizing an approach from the field of signal processing, by detecting signal in noise—to detect un-hinted anomalous subgraphs in a background network. More specifically, they applied sparse principal component analysis to the network’s modularity matrix and checked for substantial deviation in the results compared to the expected results of a random subgraph. The method was tested on a simulated network.

Miller et al. [48] studied a similar direction by adopting another method from the field of signal processing and proposed several algorithms based on the spectral properties of the principal eigenspace of a network’s residuals matrix. The algorithms analyze the residuals, comparing the residuals of an observed random subgraph to its expected value to find outliers and were able to demonstrate the detection of small, highly anomalous subgraphs, in real-world networks. In the same year, Gupta et al. [33] proposed SODA. Given a set of queried subgraphs, which are a subset of a network, Gupta et al. classified them as anomalous or not, and constructed an attribute-based classifier utilizing linear programming methods, namely, SIMPLEX. They used the classifier to predict the edge existence probability between each pair of vertices in each queried subgraph. They gave each subgraph an “outlireness” score based on the number of unexpected existing edges and the number of missing expected edges. Subgraphs with “outlireness” scores above some threshold were classified as anomalous subgraphs. Gupta et al. have achieved a precision score of 0.881 on a real-world dataset infused with generated anomalous subgraphs. As they lacked a labeled real-world dataset, they manually checked the highest-ranked subgraphs and reported to find interesting outliers.

Bridges et al. [13] proposed GBTER, a generalized version of the BTER model [57], which is a generative model that simulates a real-world community-structured network, assuming each vertex belongs to a single community. Additionally, they used a method of computing the probability distribution of a network given a generative model, which could derive the probabilities of the network’s subgraphs. Bridges et al. [13] used the GBTER model to simulate a normal network and a network infused with an anomalous community, and compared the anomaly-infused network probability distribution to the expected distribution of the normal network. The comparison was used to detect the existence of an anomaly in the network, and by examining the probabilities of the subgraphs, they detected the anomalous subgraph. Bridges et al. [13] have achieved an AUC score of 0.936 in an experiment conducted on a small simulated network.

Yu et al. [69] proposed GLAD, a generative approach incorporating ideas from both MMSB and LDA models, that enables the utilization of two forms of data acquired from a network: (1) point-wise, i.e., a single vertex’s attributes, and (2) pair-wise, i.e., the relationships between vertices. GLAD learns the parameters of the distributions that describe the latter two forms of data. Then, they rank the communities by their similarity to the learned distributions. They achieve accuracy ranging from 0.8 to 1 on a synthetic dataset and 0.25–0.45 on a real-world DBLP dataset, where publications of a specific conference are treated as normal communities and publications of other conferences are considered anomalous.

Zhao et al. [70] referred to the task of detecting an anomalous subgraph as an optimization problem that tries to find a subset of vertices that maximizes overall abnormalities. The main contribution of the research was the introduction of parallel computation of such a problem. Kumar et al. [42] studied sockpuppetry in discussion communities, where they discovered sockpuppets behave differently from benign users. One example they discovered was the more clustered ego-networks are, the more likely they are to interact with each other. In the same year, Zheng et al. [71] presented ELSIEDET, a three-stage sybil detection scheme identifying “elite” sybil users participating in the campaigns. The elite sybil users are highly rated accounts utilized to generate trustworthy and realistic-looking reviews.

Perozzi and Akoglu [53] proposed AMEN, an algorithm for ranking communities by a proposed “normality” measure, based on communities’ coherency, that is, internally consistent, and externally separated from their boundaries, based on attributes and topological structure means. Perozzi et al. tested their algorithm on several real-world datasets while introducing labeled anomalous communities by disordering the topological structure of chosen communities and changing their vertices’ attributes assignment. They have achieved AUC scores ranging from 0.18 to 0.60 on the different datasets.

Bansal and Sharma [7] presented ADENMN, which treated attributed networks as multiplex networks by splitting a network into different network layers, where each layer represented the network created by a certain attribute of the original network. By assigning the same latter “normality” score to each community at each layer, and accumulating the “layer activity”-weighted scores to uncover anomalous communities. They achieved MAP scores ranging from 0.22 to 0.51 on real-world datasets injected with anomalous communities created similarly to the latter.

Luan et al. [45] proposed RM-CNN, a convolutional neural network classifying whether a network contains an anomalous community given an expected degrees model. They supplied the model with the residuals produced by subtracting the expected adjacency matrix of a random network generated by a given random network generating algorithm with a certain set of parameters from the actual adjacency matrix. They evaluated their model by utilizing a dataset composed of simulated networks, where the anomaly-containing networks contain a dense community generated by Erdős–Rényii [22] random network generating algorithm embedded in the background network and achieved AUC scores ranging from 0.89 to 1.00.

3 Methods

In this study, we apply concepts from the domains of Graph Theory, Complex-Networks Analysis, and Supervised-Learning as building blocks of CMMAC algorithm, whose purpose is to uncover anomalous communities in complex networks. In the following subsections, we describe, in detail, the phases of our anomalous communities detection algorithm and our anomaly-infused community-structured random network generator.

3.1 Anomalous Communities Detection Algorithm

CMMAC requires two preliminary steps, (1) Community-detection in the examined network, whose results are stored as a partition map,Footnote 3 and (2) Splitting the network into train and test sets that share common structural properties, to allow training on part of the network and detect anomalous communities on the rest of network. We separated the network by splitting the partition map into two partition maps, such that the two resulting sets do not share common communities but may share common vertices.

According to our hypothesis, the task of detecting anomalous communities in networks requires the following main steps: (a) We begin by utilizing the two input partition maps to construct two bipartite networks, where each is composed of a group of vertices that “represent” communities in the original network, “regular” vertices, and edges denoting a “regular” vertex belongs to a community in the original network (see Sect. 3.1.1), (b) we then extract topological features of the newly created network and utilize the train set topological features to train a link-prediction classifier (see Sect. 3.1.2), and (c) lastly, based on the aggregation of the link-prediction classifier probability results of the test set, we extract meta-features and rank them. As anomalous communities tend to contain an improbable set of vertices, the corresponding anomalous community-representing vertices are more likely found at the bottom margin of the ranked meta-features (see Sect. 3.1.3). In the following subsections, we will elaborate on each one of these steps.

3.1.1 Constructing a Bipartite Network

To utilize information on vertices’ co-membership between communities, we begin by creating two new bipartite undirected networks based on the given partition maps. Let \(G := \langle V,E\rangle \) be a network, where V is a set of the network’s vertices, and E is a set of the network’s edges, and let \(\{c_{i}\}_{i=1}^n\in C\) be the set of n communities in G. We define the new bipartite network in the following manner: Using G, we constructed a new bipartite undirected network , where \(C^B:=\big \{{c_{i=1}^n}^B\in C^B\mid \forall {c^B_{i}}\in {C^B}, \exists {c_{i}}\in {C}\big \}\), and \(E^B:=\bigcup _{i=1}^nE_i^B\) where \(E_i^B:=\big \{{(c_i^B,v_j)}\mid v_j\in c_i \, and\, c_i\in C\big \}\). Namely, using G. we constructed a new bipartite network in which the one part is composed of all the vertices \(v\in V\) of G, and the other part consists of new vertices, where each vertex \(c^B_i\in C^B\) in BPG represents a community \(c_i\in C\) in G. The undirected edges between a community-representing vertex and a regular vertex in BPG stand for the belonging of the regular vertices to the corresponding communities in network G (see Fig. 1).

Fig. 1
figure 1

A an example of an overlapping-community-structured network G, where the set of vertices is \(\{1,2,3,4,5,6,7,8,9,10,11,12\}\in V\), the set of communities is \(\{c_{1},c_{2},c_{3}\}\in C\), and \(c_1=\{1,2,3,4\}\), \(c_2=\{3,5,6,7,8,11\}\), and \(c_3=\{6,9,10,11,12\}\). B The corresponding derived bipartite network BPG with the set of regular vertices V and community-representing vertices \(\{c^B_1, c^B_2, c^B_3\}\in C^B\), and the set of edges \(E^B\) which state the belonging of a vertex to a community

3.1.2 Constructing a Link-Prediction Classifier

After generating the communities’ bipartite network, the next step of our algorithm is to construct a link-prediction classifier. The classifier’s task is to produce a probability of the existence of an edge \((v,c^B)\) in BPG, given two vertices \(v\in V\) and \(c^B\in C^B\).

3.1.2.1 Feature Extraction

The task of link-prediction is addressed by numerous studies. Particularly, techniques based on deep neural networks [10, 18, 41] and stochastic gradient descent [32, 47] were proposed over the last decade and achieved state-of-the-art results. Since CMMAC is meant to be used in large-scale networks and to suit a variety of networks, we chose to utilize a method that efficiently extracts easy-to-compute features for link-prediction and feeds them into an efficient classification model. Fire et al. [24] presented that just by using computationally efficient features, it is possible to achieve highly accurate link prediction classifier. Similar to Fire et al. [24], we calculate a set of topological features for the edges to construct the link-prediction classifier. We used only features which are meaningful for bipartite networks and modified them to adapt for analyzing bipartite undirected networks. Namely, we define the following features:

  • Let be , a neighborhood \(\Gamma (u_i)\) is defined as the set of vertex \(u_i\)’s adjacent vertices:

    $$\begin{aligned} \Gamma (u_i):=\{u_j\mid (u_i,u_j)\in E^B\} \end{aligned}$$

    The bipartite network reasons the following property—a neighborhood of a vertex \(v\in V\) only contains community-representing vertices \(c^B\in C^B\) and vice versa.

  • The degree of is defined as:

    $$\begin{aligned} d(u_i):=\mid \Gamma (u_i)\mid \end{aligned}$$
  • For two vertices \(v\in V\) and \(c^B\in C^B\), the Total Friends of u and v is defined as the number of distinct friends that v and \(c^B\) have together:

  • As described in Sect. 3.2, the Preferential Attachment Score feature is based on the phenomenon that “rich” vertices increase their connectivity at the expense of the “poor” vertices [8]. We estimate how “rich” the two vertices \(v\in V\) and \(c^B\in C^B\) are, by multiplying their degrees:

    $$\begin{aligned} {\textit{PA}}(v,c^B) := \mid \Gamma (v)\mid \cdot \mid \Gamma (c^B)\mid \end{aligned}$$
  • The Friends Measure hints two vertices connection “strength” by the number of connections between two vertices \(v\in V\) and \(c^B\in C^B\) neighborhoods and is defined as:

    $$\begin{aligned} {\textit{FM}}(v,c^B):=\sum _{x\in \Gamma (v)}\sum _{y\in \Gamma (c^B)} \delta (x,y) \end{aligned}$$

    Where \(\delta (x,y)\) is defined as:

    $$\begin{aligned}\delta (x,y):= {\left\{ \begin{array}{ll} 1 &{} {\textit{if}} \quad x=y \, {\textit{or}} \, (x,y)\in E^B \\ 0 &{} {\textit{otherwise}} \end{array}\right. } \end{aligned}$$
  • The Shortest Path was demonstrated as a significant feature in the link-prediction task [35]. For two vertices \(v\in V\) and \(c^B\in C^B\) we define Shortest Path as:

    $$\begin{aligned}SP(v,c^B):= {\left\{ \begin{array}{ll} {\textit{shortest path length between}}\,c\,{\textit{and}}\,v^B\,{\textit{in BPG}} &{} {\textit{if}}\quad a\,{\textit{path exists}} \\ -1 &{} {\textit{otherwise}} \end{array}\right. } \end{aligned}$$

3.1.2.2 Classifier Construction Similar to Kagan et al. [40], we train a link-prediction classifier on an equivalent number of positive and negative examples, where the edges taken into account are the train set bipartite network \({\textit{BPG}}\) edges. We define a positive example as an existing edge \((v,c^B)\in E^B\), which stands for \(v\in c\), or the belonging of vertex v to community c in the original network G. We define a negative example, as a non-existing edge \((v,c^B)\notin E^B\), which implies vertex \(v\notin c\), namely, vertex v does not belong to community c in the original network G.

We uniformly sample positive and negative examples from the train network, and then calculate the features for each of the positive and negative edges in the train network and each edge in the test network. For each edge we calculate the edge features and the vertex features of both vertices (see Sect. 3.1.2.1). Finally, we utilize the XGBoost algorithm [17] to construct the bipartite link-prediction classifier. We chose XGBoost since previously conducted studies concluded XGBoost performs well in terms of accuracy and efficiency in several cases of link-prediction tasks [49, 56].Footnote 4

3.1.3 Detecting Anomalous Communities

After constructing the link-prediction classifier, we utilized it to create an unsupervised anomaly detection algorithm, which reduces the complexity of searching for anomalies in a large space (see Fig. 2). We utilized the link-prediction classifier to emit the existence probabilities of all edges in the test network. Next, we aggregated the probabilities of the edges of community-representing vertices in several forms to create different meta-features. Then, we ranked the community-representing vertices by each one of the meta-features. Lastly, we manually examined the communities indicated by the community-representing vertices ranked at the bottom margins to find anomalous communities.

3.1.3.1 Meta-Feature Extraction Inspired by Kagan et al. [40], we utilized the classifier to emit existence probabilities edges and aggregated them into meta-features. Based on the link-prediction classifier, we first provide formal definitions for the terms we use to describe the meta-features:

  • Let \(p(v,c^B)\) be the probability of the existence of an edge \((v,c^B)\) in \({\textit{BPG}}\) as emitted by the link-prediction classifier, where \(v\in V\) and \(c^B\in C^B\).

  • Let \({\textit{EdgeProbabilities}}(c^B):=\{p(v,c^B)\mid v\in \Gamma (c^B)\, and\, c^B\in C^B\}\) be the set of vertex \(c^B\) edges’ existence probabilities.

  • Let \({\textit{EdgeLabels}}(c^B):=\{{\textit{EdgeLabel}}(v,c^B)\mid v\in \Gamma (c^B)\, and\, c^B\in C^B\}\) be the set of vertex \(c^B\) edges’ labels, that is, the label classifications of the edges with respect to a predefined threshold, where \({\textit{EdgeLabel}}(v,c^B)\) is defined as:

    $$\begin{aligned} {\textit{EdgeLabel}}(v,c^B):= {\left\{ \begin{array}{ll} 1 &{} {\textit{if}}\quad p(v,c^B)\ge {\textit{threshold}} \\ 0 &{} {\textit{otherwise}} \end{array}\right. } \end{aligned}$$

Based on the above definitions, we define the following four meta-features as:

  • Edges Normality Probability Mean is defined as the probability of a community-representing vertex \(c^B\) to be normal, in other words, is the mean (\(\mu \)) taken over the existence probabilities of its edges:

    $$\begin{aligned} {\textit{EdgesNormalityMean}}(c^B):=\mu ({\textit{EdgeProbabilities}}(c^B)) \end{aligned}$$
  • Edges Normality Probability STDV is defined as one minus the standard deviation (\(1 - \sigma \)) of a set of vertex \(c^B\) edges’ existence probabilities:Footnote 5

    $$\begin{aligned} {\textit{EdgesNormalitySTDV}}(c^B):= 1 - \sigma ({\textit{EdgeProbabilities}}(c^B)) \end{aligned}$$
  • Predicted Edge Labels Mean is defined as the mean of the set of predicted labels of vertex \(c^B\)’s edges:

    $$\begin{aligned} {\textit{PredictedEdgeLabelsMean}}(c^B):=\mu ({\textit{EdgeLabels}}(c^B)) \end{aligned}$$
  • Predicted Edge Labels STDV is defined as the standard deviation of the set of predicted labels of vertex \(c^B\)’s edges:

    $$\begin{aligned} {\textit{PredictedEdgeLabelsSTDV}}(c^B):= 1 - \sigma ({\textit{EdgeLabels}}(c^B)) \end{aligned}$$

3.1.3.2 Meta-Feature Ranking After obtaining the meta-features of all community-representing vertices \(c^B\in C^B\) in the test network, we ranked the vertices by each one of the meta-features. We then manually examined the communities indicated by the corresponding k bottom vertices at each ranked meta-feature, where k is a defined threshold.

Fig. 2
figure 2

Algorithm overview. A After constructing the bipartite network, we created a link-predictor based on its topological features and predict edges’ existence. B For each community-representing vertex, we aggregate the predicted probabilities into meta-features, for example, averaging them. C We rank them by the meta-features. D We fetch the corresponding original communities and manually examine them

3.2 Anomaly-Infused Community-Structured Random Network Generator

To evaluate the proposed method, we striven to generate community-structured networks similar to real-world scenarios. A mutual property of many complex networks is that the vertex connectivity follows a power-law distribution [23]; reflecting the fact new vertices attach preferentially to existing high-degree vertices [8]. Furthermore, in real-world networks, the majority of the communities have a certain extent of overlap [1, 46], there exist vertices’ co-memberships between them.

Based on the above two statements, we reasoned generating a network where each community follows preferential attachment property and generating connections between these communities, would be a well-suited notion to mimic real-world overlapping community-structured networks. A simple implementation of the described concept would be to generate subnetworks using the Barabási–Albert algorithm [8], i.e., creating subnetworks by adding new vertices, each with m edges attached preferentially to existing vertices with high degree, and then connect them by connecting pairs of vertices from different subnetworks with a certain probability p.

We developed an algorithm encapsulating the essence of this concept and generalizing it further. Our algorithm creates subnetworks of two types, normal and anomalous, using two different random network generating algorithms. It then connects them in a “dual-preferential” manner, by connecting vertices from “new” subnetworks to existing subnetworks by a probability corresponding to the subnetworks’ sizes, and within the chosen subnetworks to vertices with a probability corresponding to their degrees. The connected subnetworks are considered as overlapping communities in the created network. The two types of random network generating algorithms allow different types of “normal” and “anomalous” communities in the generated network.

The algorithm creates an overlapping community-structured network composed of normal and anomalous communities, given the following four parameters for each of the groups, normal communities and anomalous communities (see Algorithm 1): (1) Random network generating algorithm (denoted \({\textit{alg}}\)), (2) a list of communities’ sizes to create (denoted \({\textit{comm}}\_{\textit{sizes}}\)), (3) arguments needed for random network generating algorithm (denoted \({\textit{args}}\)), and (4) a fraction of inter-connection to create between communities (denoted \({\textit{inter}}\_p\)). It returns the network, as well as its partition map describing the communities’ belonging vertices. A detailed description of the algorithm is presented in “Appendix A”, and we have published the implementation of the algorithm as an open-source code. An evident example and a comprehensive explanation of how to choose parameters for the algorithm are provided in “Appendix B”.

figure a

4 Experimental Setup

4.1 Data Description

We evaluated our algorithm on two labeled datasets and performed two case studies by applying our algorithm on two unlabeled networks. We generated over 10 GB of networks data to be utilized throughout the study. The following subsections describe the creation of the datasets.

4.1.1 Labeled Datasets

To the best of our knowledge, there are no publicly available network datasets with labeled anomalous communities. To evaluate our proposed algorithm we utilized two labeled datasets: (1) Networks created from Reddit subnetworks infused with anomalous communities, and (2) fully simulated networks with anomalous communities created by our network generator.Footnote 6 To avoid cherry-picking and to learn both strengths and weaknesses of CMMAC, we created the datasets such that they represent various anomalous communities’ situations, and specifically contain the regions where CMMAC changes from underperforming to outperforming the other methods.

In the following subsections we describe the processes of acquiring real-world data, its perturbation to introduce labeled anomalous communities, and the generation of the fully simulated labeled networks.

4.1.1.1 Real-World Networks Infused with Artificial Anomalies

Reddit is a popular collection of forums where people share news, content, or comment on others’ posts. Reddit is composed of hundreds of thousands of communities, also called “subreddits.” Each subreddit is devoted to a different topic such as sports, sciences, and events [66]. Using Reddit data, Jason Michael Baumgartner constructed a massive dump of Reddit comments, which he published and maintained [38]. This dataset contains the ID and the time each comment was posted, the subreddit it was posted in, the user who posted the comment, and the ID of the parent comment.Footnote 7 In this study we utilized data obtained from the Reddit comments dataset, cleaned, and preprocessed by Fire and Guestrin [23].Footnote 8 The data contains over 2.37 billion posts posted from December 2005 through October 2016, by 19.72 million unique users, in 20,136 subreddits, each with more than 1000 comments.

To evaluate CMMAC on anomaly-infused real networks, we utilized the Reddit comments dataset to create 1000 networks and infused them with generated anomalous communities for creating a dataset with ground truth labels. To create each of the anomaly-infused real networks, we sampled random subreddits from the Reddit comments dataset and constructed their networks. To follow the overlap property of real networks [46] and to preserve a certain degree of co-membership information, which is required for CMMAC, we constrained each sampled subreddit to have at least three users in common with at least two other subreddits in the network.

Formally, for each subreddit \(s_i, i=1\ldots k\), we define the subreddit’s network to be: \(G^i:=\langle V^i, E^i\rangle \), where \(V^i\) is the set of vertices representing unique users who posted or commented within the subreddit \(s_i\), and \(E^i\) is the set of edges representing connections between users in subreddit \(s_i\). Each edge \((u,v)\in {E^i}\) exists if a user u replied to a comment, or as been replied to, by a user v, within subreddit \(s_i\). For each subreddit \(s_i, i=1\ldots k\), exists at least two subreddits \(s_m\) and \(s_j\), where and . Next, we merged the k networks into a single network, that is, \(G:=\langle V, E\rangle \), where and .

Table 1 Networks created by merging Reddit comments dataset subreddits’ networks, each composed of 110 subreddits

Lastly, we utilized some functionality of our network generator, specifically, only the anomalous parameters tuple, (see Sect. 3.2) to generate a anomalous communities and attach them to the network. We attached them in a “dual-preferential” manner, by connecting vertices from the generated communities to subreddits chosen by a probability corresponding to their sizes, and within them to vertices chosen by a probability that corresponds to their degrees. Since the “new” vertex connected with a high chance to a “central” vertex in the subreddit it attached to, we considered it as part of the subreddit (A detailed description of our network generator and the “dual-preferential” attachment property is presented in “Appendix A”).

In this study, we constructed 1000 networks as described above, by creating five networks composed of \(k=110\) subreddits (see Table 1), and attaching each of them 200 distinct sets of ten anomalous communities (\(a=10\)), where each generated with a different combination of parameters fed to our network generator. The motivation for choosing the parameters is to get experiment results that avoid cherry-picking, and properly present the regions where CMMAC changes from underperforming to outperforming the other methods to learn and report its strengths and weaknesses. (For further details on the selection of network generation parameters, see “Appendix B”). Specifically, we used the following parameters grid: (1) \({\textit{alg}}_{{\textit{anom}}}=\)ErdősRényii [22], (2) \({\textit{args}}_{{\textit{anom}}}\in {\{0.05, 0.1, 0.2, 0.4, 0.8\}}\)Footnote 9(3) \({\textit{inter}}\_p_{{\textit{anom}}}\in {\{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4\}}\), and (4) \({\textit{comm}}\_{\textit{sizes}}_{{\textit{anom}}}\in {\{q_{0}, q_{0.1}, q_{0.25}, q_{0.5}, {\textit{random}}\}}\).Footnote 10

4.1.1.2 Fully Simulated Networks Boshmaf et al. [11] studied the vulnerability of OSNs’ to large-scale infiltration by socialbots. They created a Socialbot Network (SbN), that is, a community of fake users that form many connections among each other to generate attraction from regular users. Next, the fake users randomly connect to real users in the targeted OSN. Then, to avoid detection due to anomalous structure, or due to the detection of one fake user who presented anomalous behavior, the SbN decomposes by deleting connections between the fake users. Finally, the SbN performs an attack of choice, usually information harvesting for spreading fake news Boshmaf et al. [11].

Inspired by Boshmaf et al. [11], we utilized our network generator (see Sect. 3.2) to evaluate CMMAC on synthetic networks that simulate different points in the progress of the SbN decomposition and different networks’ properties, by generating 1000 anomaly-infused community-structured random networks.

We chose parameters for the network generator so (1) the “normal” part of each network imitates the properties of a real network, in particular, the Reddit’s network described in Section 4.1.1.1, and (2) the “anomalous” part will provide experimental results that properly present the regions where CMMAC starts outperforming the other methods. For further details on the selection of network generation parameters see “Appendix B”.

We constructed the 1000 fully simulated networks using the following parameters: (1) \({\textit{alg}}_{{\textit{norm}}}=\)BarabásiAlbert [8], (2) \({\textit{args}}_{{\textit{norm}}}=1\)Footnote 11 (3) \({\textit{inter}}\_p_{{\textit{norm}}}=0.075\), (4) \({\textit{comm}}\_{\textit{sizes}}_{{\textit{norm}}}\in {\{{\textit{random}}\_{\textit{sample}}_i, i=1..5\}}\), where \({\textit{random}}\_{\textit{sample}}\) is a set of 5 distinct lists, each composed of 110 community sizes, sampled from the Reddit comments dataset subreddit’s sizes distribution, (5) \({\textit{alg}}_{{\textit{anom}}}=\)ErdősRényii [22], (6) \({\textit{args}}_{{\textit{anom}}}\in {\{0.01, 0.02, 0.04, 0.08, 0.16\}}\) to compensate for the relatively low average degree of the normal communities, (7) \({\textit{inter}}\_p_{{\textit{anom}}}\in {\{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4\}}\), and (8) \({\textit{comm}}\_{\textit{sizes}}_{{\textit{anom}}}\in {\{q_{0}, q_{0.1}, q_{0.25}, q_{0.5}, {\textit{random}}\}}\) is a list of ten community sizes, sampled from the 110 normal community sizes distribution, where \(q_{x}\) denotes quantile x of the distribution and \({\textit{random}}\) denotes uniform random sampling from the distribution.

4.1.2 Unlabeled Real-World Networks

This section describes the process of acquiring, cleaning, and preprocessing real-world unlabeled network datasets, to which we applied our algorithm to discover meaningful insights. We utilized the Reddit comments dataset, specifically data from the “r/Place” project, and the Hebrew Wikipedia revisions data.

4.1.2.1 Reddit’s r/Place Network

On April 1st, 2017, a collaborative project and social experiment called “r/Place” was initiated by Reddit [19]. The creators created a white \(1000\times {1000}\)—pixel canvas and posted it online with a call for users to edit it, hinting them for collaboration. Users could only change one pixel color every 5 mins. After 72 h, and more than a million unique users, the canvas colors had been changed more than 16.5 million times. The canvas turned into a beautiful skirmish of nations’ flags and symbols, ideologies, famous paintings and characters, and much more.

We utilized the Reddit comments dataset [38] by filtering 5.8 million comments from over one million unique users in 12,870 subreddits posted between April 1st through April 4th at midnight, particularly, at the time of the r/Place project. We cleaned the data by removing comments that did not include information about the author. We then grouped the comments by subreddit and cleaned the data further by removing subreddits that contained less than 50 comments.

We processed the data further by filtering in only the 610 subreddits that actively participated in the r/Place project,Footnote 12 and from those we chose 468 subreddits that ranged in size from 50 to 2500. We created a network from the resulting 468 subreddits with the same process described by Fire and Guestrin [23]. The formal definition is similar to the definition described in Section 4.1.1.1, without overlapping constraints. The resulting network consisted of 181,019 vertices and 339,306 edges.

4.1.2.2 Hebrew Wikipedia Revisions Network

Wikipedia is a free, open-content collaborative online encyclopedia maintained by volunteering editors, also called Wikipedians. Wikipedia is one of the most popular websites [58], containing more than 174 million articles, in more than 300 languages, and which are read monthly by 1.5 billion unique visitors as of November 2020 [26]. In the last 2 years, the articles have been maintained by an average of 45 million edits per month, also called Revisions, which are performed by an average of 70 thousand active Wikipedians [26].

We utilized Quarry,Footnote 13 an online public interface for running SQL queries against the Wikipedia database, to acquire all revisions made to articles in the Hebrew Wikipedia between January 1st, 2016, and July 14th, 2020 (To view the utilized query See “Appendix D”). The data contains almost 7.5 million revisions, performed by 295,263 Wikipedians, in 269,355 articles. The revisions dataset contains the revision ID, as well as information about the Wikipedian who performed the revision, its timestamp, the ID of the parent revision,Footnote 14 and the article title that was revised. We inner-joined the data with itself on the \({\textit{parent}}\_{\textit{revision}}\_{\textit{id}}\) attribute to get the network of revising Wikipedians answering to each other, where the articles they revise are considered as communities.

We further preprocessed the data by filtering in only articles with between 20 and 80 distinct revising users and users that revised between 5 and 300 unique articles, to enforce overlap between articles regarding revising users. The resulting data contained 72,633 revisions done in 2123 articles.

Formally, for each article \(a_i, i=1\ldots k\), we define the article network to be: \(G^i:=\langle V^i,E^i\rangle \), where \(V^i\) is the set of vertices representing unique Wikipedians who revised article \(a_i\), and \(E^i\) is the set of edges, representing the connections between Wikipedians within article \(a_i\). Each edge \((u,v)\in {E^i}\) exists if Wikipedian u revised Wikipedian v’s revision or the opposite, within article \(a^i\). We then constructed the whole Hebrew Wikipedia revisions network by merging the article networks; that is, \(G:=\langle V,E\rangle \), where and . The resulting network consisted of 12,736 vertices and 13,765 edges.

4.2 Experiments

To extensively evaluate our algorithm, we utilized the labeled datasets described in Sect. 4.1. We first split the datasets to train and test sets as follows: For each of the labeled networks described in Sect. 4.1.1, we selected 20 communities for the train set to train a suitable link-prediction classifier and 100 communities for the test set. Furthermore, we split the communities, so the train set was composed only of normal communities, while the test set was composed of 90 normal communities and ten anomalous communities (10%), which represents an estimation of anomalies percentage in an average social network [40]. For the \(r/{\textit{Place}}\) dataset described in Sect. 4.1.2.1, we randomly chose 100 communities for the train set and 350 communities for the test set. Finally, for the Hebrew Wikipedia revisions dataset described in Section 4.1.2.2, we randomly selected 100 articles for the train set and 1000 articles for the test set. For further details on our train-test split methodology, see “Appendix C”.

For each of the labeled datasets, we began by analyzing the ranking predictive ability of each of the meta-features, namely, by ranking the communities by each of our algorithm’s meta-features, comparing the resulting rankings, and choosing one meta-feature to use for the comparison to other methods. Then, we compared CMMAC’s performance to other methods with respect to the three parameters (\({\textit{comm}}\_{\textit{sizes}}_{{\textit{anom}}}\), \({\textit{args}}_{{\textit{anom}}}\), and \({\textit{inter}}\_p_{{\textit{anom}}}\)) we used to create and attach the anomalous communities in our experiments.

The other methods we utilize for the comparison are the following known topology-based measures and methods: (a) Average degree [16]—the average degree of all vertices in a community; (b) Cut ratio [68]—the fraction of existing cut edges out of all possible edges; (c) Conductance [6]—the fraction of total edge volume that points outside the community); (d) Flake-ODF [27]—the fraction of community vertices that have fewer edges pointing inside the community than to the outside; (e) Average-ODF [27]—the average fraction of the community cut; (f) AMEN [53]; and (g) ADENMN [7].

To utilize the latter two algorithms in our study,Footnote 15 we implemented in Python the Unattributed-AMEN/ADENMN algorithm, which is based on AMEN [53] and ADENMN [7] (both share the same topological-based part), but only considers the topological structure of the network and ignores the vertex attribute-based logic. Namely, it omits the vectors of attributes and corresponding weights, and the weights learning process. To be precise, Unattributed-AMEN/ADENMN implements the “normality score” expressed by:

$$\begin{aligned} N=\sum _{\begin{array}{c} i\in {C}, j\in {C}, \\ i\ne {j} \end{array}}\left( A_{ij}-\dfrac{k_i\cdot {k_j}}{2\cdot {\mid E\mid }}\right) - \sum _{\begin{array}{c} i\in {C}, b\in {B}, \\ (i,b)\in {E} \end{array}}\left( 1-\min \left( 1, \dfrac{k_i\cdot {k_b}}{2\cdot {\mid E\mid }}\right) \right) , \end{aligned}$$

where \(k_i\) denotes the degree of vertex i, C denotes a community, A denotes the adjacency matrix of C, B denotes the set of boundary-vertices of C, and E denotes the set of edges in the whole examined network.

Since our algorithm is a ranking algorithm, we utilize evaluation measures from the field of information retrieval. Most of the baselines we compare to are simple, while we consider AMEN and ADENMN more advanced algorithms. To properly compare CMMAC to them, we use the same evaluation measure they used [7, 53], average precision obtained from the AUC of the precision-recall curve [60]. To compare CMMAC meta-features we use the measured MAP, obtained by taking the mean of several average precision scores.

To uncover anomalous communities in the unlabeled datasets, we first utilized CMMAC to rank the communities in each of the networks’ test sets. Then, to reduce the problem searching space, we selected only the three communities ranked at the bottom by each meta-feature and intersected them into one set. We also utilized the described baselines to rank the communities and intersected their three bottom-ranked communities as well.

To report reliable results, since the data was unlabeled, we manually examined each of the resulting communities, both by CMMAC and by the other methods, to seek anomalies: (1) in the Reddit r/Place project dataset, we comprehensively reviewed posts during and related to the r/Place project, within the examined subreddit, as well as posts that generally review the r/Place project and mentions the examined subreddits, and looked up anomalous behavior, and (2) in Wikipedia, we developed a code that produces a list of differences between each pair of consecutive revisions for a given article throughout a specific period. The output is composed of deletions and additions of content, as well as special actions, such as page protection activation. Finally, we sought anomalous behavior through extensive reviewing of the differences.

5 Results

The following section presents the results obtained from the experiments we conducted. First, we describe the evaluation results of the labeled datasets (see Sect. 5.1). Then, we present the communities that were revealed when applying our method to two unlabeled real-world network use cases (see Sect. 5.2).

5.1 Labeled Datasets

To evaluate our method on labeled datasets, we utilized the 2,000 networks we created in two datasets, as described in Sect. 4.1.1. We first analyze the predictive ranking ability of the meta-features in each of the datasets. Within the Reddit-based networks dataset, among all meta-features, the \({\textit{EdgesNormalitySTDV}}\) achieved the highest MAP score of 0.526 (see Fig. 3), while within the fully simulated networks dataset, the \({\textit{PredictedEdgeLabelsMean}}\) and the \({\textit{PredictedEdgeLabelsSTDV}}\) meta-features achieved the highest MAP score of 0.554 (see Fig. 4). Consequently, we utilized the latter meta-features to compare CMMAC to the other methods. Specifically, we utilized the \({\textit{EdgesNormalitySTDV}}\) meta-feature to report the comparison results in the Reddit-based networks dataset (see Fig. 5), and the \({\textit{PredictedEdgeLabelsSTDV}}\) meta-feature to report the comparison results in the fully simulated networks dataset (see Fig. 6).

Fig. 3
figure 3

a Average precision achieved by each meta-feature when applied to Reddit-based anomaly-infused networks. Each bar color indicates a different meta-feature, and the whiskers indicate the confidence interval obtained by 125 networks’ results. b MAP score of each meta-feature, calculated by taking the mean average precision scores at all \({\textit{inter}}\_p_{{\textit{anom}}}\) values range

Fig. 4
figure 4

a Average precision achieved by each meta-feature when applied to fully simulated networks. Each bar color indicates a different meta-feature, and the whiskers indicate the confidence interval obtained by 125 networks’ results. b MAP score of each meta-feature, calculated by taking the mean average precision scores at all \({\textit{inter}}\_p_{{\textit{anom}}}\) values range

Fig. 5
figure 5

Evaluation of CMMAC’s \({\textit{EdgesNormalitySTDV}}\) meta-feature and comparison to other methods on Reddit-based anomaly-infused networks. Each row of sub-plots describes a different \({\textit{args}}_{{\textit{anom}}}\) value, namely, the probability of edge existence between each pair of vertices in anomalous communities (density). Each column of subplots describes a different \({\textit{comm}}\_{\textit{sizes}}_{{\textit{anom}}}\) value, that is, a quantifier describing the size of the anomalous communities as compared to the normal communities. At each of the subplots the X-axis describes \({\textit{inter}}\_p_{{\textit{anom}}}\) values, namely, the percentage of vertices in the anomalous communities that form “dual-preferential” inter-connections to normal communities. For each of the parameters obtained by this grid, we measured the average precision score, obtained from the AUC of the precision-recall curve, [60] which the Y-axis describes

Fig. 6
figure 6

Evaluation of CMMAC’s \({\textit{PredictedEdgeLabelsSTDV}}\) meta-feature and comparison to other methods on fully simulated networks. Each row of subplots describes a different \({\textit{args}}_{{\textit{anom}}}\) value, particularly the probability of edge existence between each pair of vertices in anomalous communities (density). Each column of subplots describes a different \({\textit{comm}}\_{\textit{sizes}}_{{\textit{anom}}}\) value, that is, a quantifier describing the size of the anomalous communities as compared to the normal communities. At each of the subplots, the X-axis describes \({\textit{inter}}\_p_{{\textit{anom}}}\) values, namely, the percentage of vertices in the anomalous communities that form “dual-preferential” inter-connections to normal communities. For each of the parameters obtained by this grid, we measured the average precision score, obtained from the AUC of the precision-recall curve, [60] which the Y-axis describes

5.2 Unlabeled Real-World Networks

This section presents the findings we discovered in two real-world unlabeled datasets. We utilized our algorithm to rank the communities by each of the meta-features. We then chose only the distinct communities ranked at the three lowest rankings by each of the meta-features (see subreddits list in “Appendix E”), thereby we reduced the searching space drastically. We then manually examined each of the resulting communities described in Sect. 4.2 and encountered interesting case studies. To verify the fidelity of the results obtained by CMMAC, we also utilized all the other methods described in Sect. 4.2 and manually examined their results.

5.2.1 Reddit’s r/Place Network

To evaluate CMMAC on Reddit’s r/Place project test set we utilized the network construction method described in Section 4.1.2.1. We utilized CMMAC to rank the subreddits by each of the meta-features, as well as utilized the other methods to rank the subreddits. We selected the three bottom-ranked subreddits (see Tables 3 and 4). Intersecting the three bottom-ranked subreddits resulted in ten distinct subreddits returned by CMMAC, and nine distinct subreddits returned by the other methods, where there are no common subreddits between our algorithm and the other methods.

We manually examined all the subreddits as described in Sect. 4.2, and exposed the following subreddits which presented abnormal behaviors, which were returned by CMMAC:

  • r/BlueCornerFootnote 16—According to the Redditor Andrewcshore315 [62] a member of the r/TheBlueCorner subreddit, the r/BlueCorner began as a violent subreddit that tried to paint the whole canvas with blue pixels, ruining other artifacts on its way. Due to its behavior, it quickly gained enemies, which made its users cease to cooperate and eventually abandon it. Many of the deserting users joined a new subreddit called r/TheBlueCorner, which was led by new leadership, this time aiming to maintain the structure of the blue corner while respecting and protecting other arts. The subreddit r/BlueCorner was ranked 349 out of 350, that is, 2nd from the bottom by CMMAC’s \({\textit{PredictedEdgeLabelsSTDV}}\) meta-feature. For the rankings achieved by the other meta-features and by the other methods see Table 2.

  • r/COMPLETEANARCHYFootnote 17—A comprehensive inspection involving a systematic investigation through its posts during and related to the r/Place project, as well as questioning some of the participants brought up that subreddit r/COMPLETEANARCHY presented an anomalous behavior—a complete failure of collaboration. Namely, there were many attempts to propose ideas, tactics, and courses of action, which were hardly commented upon and were never executed. Today there are no traces of their participation in the r/Place project. The subreddit r/COMPLETEANARCHY was ranked 348 out of 350, that is, 3rd from the bottom by CMMAC’s \({\textit{PredictedEdgeLabelsSTDV}}\) meta-feature. For the rankings achieved by the other meta-features and by the other methods see Table 2.

We could not detect any anomalous subreddits among the subreddits returned by the other methods.

5.2.2 Hebrew Wikipedia Revisions Network

To evaluate CMMAC on the Wikipedia revisions network test set, we utilized the network construction method described in Sect. 4.1.2.2. We utilized CMMAC to rank the article-representing communities by each of the meta-features, as well as utilized the other methods to rank the communities. We selected the three bottom-ranked articles (see Tables 5 and 6). Intersecting the three bottom-ranked articles resulted in six distinct articles returned by CMMAC and 12 distinct articles returned by the other methods. CMMAC’s results and the other methods’ results shared one common article.

We manually examined all the articles as described in Sect. 4.2, and detected the following article, which presented anomalous behavior within its revisions history according to CMMAC results:

  • COVID-19 effects on the (Israeli) education systemFootnote 18 [67] - The article COVID-19 effects on the education system [67] was created in April 2020 and was dedicated to the effect of COVID-19 on the Israeli education system. In the two and a half months it existed within our dataset, many of its revisions were in regard to the gaps between the different Israeli society sectors, and government criticism. This article was used as a fertile ground for a political scuffle, due to its fast popularity gaining on account of COVID-19 related news article. The article COVID-19 effects on the education system was ranked 998 out of 1000 articles, that is, 3rd from the bottom by CMMAC’s \({\textit{EdgesNormalitySTDV}}\) meta-feature. For the rankings achieved by the other meta-features and by the other methods see Table 2.

We could not detect any anomalous articles among the articles returned by the other methods.

Table 2 The ranking of the anomalous communities we ranked by each of CMMAC’s meta-features and other methods

6 Discussion

By analyzing the results presented in Sect. 5, the following can be noted:

First, by analyzing the behavior of CMMAC along the X-axis, namely the \({\textit{inter}}\_p_{{\textit{anom}}}\) values, at each of the subplots in Figs. 5 and 6, we can conclude that CMMAC performance is correlated to the fraction of inter-connections between anomalous communities and other communities. Specifically, it performs better as the fraction of inter-connections arises. Namely, when more cross-boundary edges exist between anomalous and other communities. As the percentage of inter-connections increases, more vertices communities’ co-membership information is available to CMMAC, thereby enhancing its performance. However, the inferior results achieved by setting lower inter-connections fraction values indicate a limitation of CMMAC. Namely, CMMAC depends on a somewhat degree of overlap between communities in the examined network. Networks without overlapping communities lack essential information for CMMAC to work properly. Nonetheless, most of the communities in real-world networks tend to overlap [1, 46].

Second, CMMAC requires inputs in form of partition maps that indicate each community’s contained vertices. The creation of the partitions maps relies on a preliminary step of detecting overlapping communities in the observed network. The latter is a hard task, especially when utilizing only structural properties [63]. The combination of the dependency on detecting the overlapping communities, and the fact overlapping between communities is essential for CMMAC, presents a limitation of our approach. When we utilize non-network data and model it as a network in which we create communities according to the definition of the problem, we skip the need of detecting overlapping communities. For example, the creation of the Wikipedia revisions network is described in Sect. 4.1.2.2.

Third, by analyzing the subplots along the rows in Figs. 5 and 6, namely, the densities of the anomalous communities, we can conclude CMMAC is not affected by the density of a community, while all the other internal-consistency-basedFootnote 19 methods are, videlicet, Average degree, Conductance, Flake-ODF, and Unattributed-AMEN/ADENMN. Specifically, all the internal-consistency-based methods’ performances degraded as the anomalous communities get sparser.

Fourth, by examining the evaluation results concerning the size of anomalous communities’, i.e., along the columns in Figs. 5 and 6, we infer CMMAC is not affected by the size of a community, whereas all the other methods are affected by the size. In particular, the other methods achieve poorer scores as the anomalous communities become smaller.

Fifth, according to the overall evaluation results in Figs. 5 and 6, we can conclude that CMMAC outperforms other methods in the cases where the properties of the anomalous communities become similar to the rest of the communities, and when there are many cross-boundary edges between the anomalous communities and the other communities. Simply put, in the scenarios where the anomalous communities are small, sparse, and hard to separate from the other communities. It is important to keep in mind the latter finding was achieved and holds for a network whose structure follows a power-law distribution. To the best of our knowledge, no other method utilizes co-membership information. Particularly, the methods we utilized as baselines are founded upon either internal-consistency, external-separability,Footnote 20 or both. The mutual property of all these methods (apart from Average degree) is that they all degrade when the boundaries fade, that is, when the fraction of inter-connections arises.

Sixth, we showed that CMMAC is a suitable solution for identifying malicious communities in an OSN, such as Socialbot Networks. Their fake users connect randomly to other normal users and then detach their internal edges [11]. However, we presume CMMAC would be less effective in cases where the malicious communities present a “more specific” strategy of connecting to other communities, other than the “dual-preferential” attachment, such as connecting to vertices with similar attributes. We believe the described case will result in fewer “unexpected” edges, which in their turn, will contribute more “false” data for CMMAC’s link-predictor. In the future, we intend to improve our Anomaly-Infused Community-Structured Random Network Generator by adding an “attribute-oriented” attachment functionality, to simulate such cases. To enable CMMAC to handle such cases, we plan to reinforce its link-predictor with features that are based on attributes of vertices.

Seventh, regardless of the latter specific case we described, we firmly believe that utilizing attributes of vertices will enhance the performance of CMMAC. The advantage of the generality of CMMAC will not decrease since attributes of vertices could be utilized generically without needing specific domain knowledge or understanding. For this reason, we also aim to equip CMMAC with the functionality of utilizing vertices’ attributes generically and feeding them into the link-predictor, which will result in more accurate results of CMMAC. Determining the attributes of vertices is straightforward. Notwithstanding, determination and exploitation of community-representing vertices’ attributes require certain manipulation of attributes’ information, which should be further researched and developed.

Eighth, an approach of classifying anomalous communities rather than reducing the manual searching space by ranking is undoubtedly preferable. However, uncovering anomalous communities is a challenging task, particularly since, to the best of our knowledge, there are no existing labeled datasets and because different networks present different anomalous behaviors. We intend to enhance CMMAC further by developing the ability to receive a semi-labeled dataset and train a classifier that utilizes the meta-features and possibly additional features and the labeled communities and to classify the rest of the communities.

Finally, according to the real-world non-labeled networks results (see Sect. 5.2), we demonstrate CMMAC can be applied to detect anomalous communities “in the wild” in different domains by ranking communities that presented abnormal behavior at the bottom (see Table 2). The two non-labeled datasets we tested are a relatively small sample to test, hence, we intend to test CMMAC on more real-world non-labeled datasets. While uncovering anomalous communities in “native-network” Footnote 21 data, such as Reddit’s r/Place project network, is a trivial task, the Wikipedia revisions network is an example of structuring non-trivial data into a network and utilizing CMMAC to detect anomalies within it. We depicted articles as communities and the Wikipedians who edited them as vertices, and by utilizing CMMAC, we uncovered articles containing anomalous revisions history. By generalizing this example, we believe CMMAC can be utilized to detect anomalies in a variety of domains, in which there exists data that can be modeled as a community-structured network and the resulting network contains a certain extent of overlap between communities.

7 Conclusion and Future Work

The detection of anomalous communities in complex networks is becoming progressively prominent in our networked world. We present a novel generic method for detecting abnormal communities based solely on the co-membership of vertices to communities. Our approach is composed of graph theory notions and straightforward yet accurate machine-learning-based link-prediction techniques. In addition, we developed an algorithm that generates an overlapping community-structured random network to empower further research in the field.

We evaluated our method on 1000 networks generated by us and on 1000 networks sampled from Reddit’s comments dataset, where each contained tens of thousands of vertices and edges. We demonstrated our method succeeds in the scenarios where other known methods fail, specifically, when the anomalous communities are well disguised in the background, namely, they are sparse and heavily connected to other communities. We further demonstrated our method could detect anomalous communities in real-world networks by uncovering a violent subreddit and a collaboration-failing subreddit in the Reddit comments network and a Wikipedia article filled with inciting revisions.

Our open framework can be instantly utilized to gain insights into any data modeled as a community-structured network while providing a cost-effective practice that reduces a massive space of potential anomalies to a relatively small, threshold-dependent number of options to explore.

Future directions could be to add more structural features, such as edge weights, and to add vertex attributes to be used as features to enhance the community membership prediction ability in specific domains. Additionally, we aim to examine the use of more advanced techniques based on deep neural networks to construct the link-prediction classifier, to overcome the possible sub-optimal results achieved by the hand-crafted features. Moreover, we intend to transform CMMAC into a classifying algorithm rather than a ranking algorithm, in cases where it is applicable, that is, when the dataset is partially labeled.