1 Introduction

Bipartite graphs have been widely used in real-world applications to model relationships between entities of different types, such as customer-product networks [50], author-paper networks [29] and user-event networks [13]. A bipartite graph is denoted by \(G=(V_L,V_R,E)\), where the vertex set is partitioned into two disjoint subsets \(V_L\) and \(V_R\) which are referred to as the L-side and R-side vertices of the bipartite graph, respectively; each edge \(e\in E\) can only connect vertices from different sides. Finding dense subgraphs in a bipartite graph is of great significance and encompasses many applications, such as community detection [1, 27], anomaly detection [16, 44], and group recommendation [37, 46].

One classic notion of dense bipartite subgraph is biclique [41], which requires every pair of vertices from different sides of the subgraph to be connected by an edge. For example, for the bipartite graph in Fig. 1 which represents researchers publishing in conference venues, the subgraph in the shadowed area is a biclique. In the literature, many algorithms have been proposed to enumerate all maximal bicliques [1, 3, 14, 31, 34, 49, 56] and to identify a biclique of the maximum size [37]. However, the biclique model has a fundamental limitation: vertices in a biclique are not necessarily similar to each other, despite that they share a set of common neighbors (i.e., vertices on the other side of the biclique). Consider the six researchers in the biclique in Fig. 1, all of them publish in database conferences. Besides, researchers \(r_1,r_2,r_3\) also publish in machine learning (ML) conferences, while \(r_4,r_5,r_6\) publish in high-performance computing (HPC) conferences. Thus, the two groups of researchers, \(\{r_1,r_2,r_3\}\) and \(\{r_4,r_5,r_6\}\), are likely from different backgrounds and communities, i.e., ML versus HPC.

Fig. 1
figure 1

Example of researcher-venue bipartite graph

Motivated by this, we formulate the notion of similar-biclique by requiring all vertices from one side of the biclique to be similar to each other. Our empirical studies show that similar-bicliques can be detected much more efficiently than bicliques. Identifying similar-bicliques is useful for the following applications.

  • Community detection. Similar-biclique satisfies all the key requirements of community structure for bipartite graphs [27, 51, 57], and thus can be used to detect communities in interaction-type bipartite graphs such as user-rate-movie, customer-buy-product, and author-write-paper. Firstly, being a biclique, interactions between vertices from the two sides are intensive. Secondly, by enforcing the similarity constraint, users in a similar-biclique are similar to each other, e.g., having similar behaviors or interests.

  • Anomaly detection. Similar-biclique can also be used for anomaly detection, which is a common task in e-commerce [4, 12, 37]. Here, the transactions of customers purchasing products form a customer-product bipartite graph. To improve the ranking of certain products, fraudsters may create fake accounts to purchase the products, i.e., click farming [10]. These fake accounts and the products they promote inevitably form a densely connected group, and meanwhile, these fake accounts will display a high level of synchronized behavior with each other [21]. Thus, suspicious groups (i.e., both the fraudulent accounts and the products they promote) can be captured by similar-bicliques.

Formally speaking, given a similarity threshold \(0 < \varepsilon \le 1\) and a size constraint \(\tau \ge 0\), a vertex subset \(C \subseteq V_L\cup V_R\) in a bipartite graph \(G=(V_L,V_R,E)\) is a similar-biclique if (1) C is a biclique (i.e., \(C_L\times C_R \subseteq E\)), (2) all vertices of \(C_L\) are similar to each other (i.e., \({\textsf{sim}} (u,v) \ge \varepsilon , \forall u,v \in C_L\)), and (3) C satisfies the size constraint (i.e., \(|C_L|\ge \tau \) and \(|C_R|\ge \tau \)). Here, \(C_L\) denotes \(C\cap V_L\) and \(C_R\) denotes \(C\cap V_R\); \({\textsf{sim}} (u,v)\) measures the structural similarity between u and v, which is computed based on their neighbors N(u) and N(v) and will be formally defined in Sect. 3; the size constraint \(\tau \) is introduced to avoid generating too small or too skewed results. Note that, we only apply the similarity constraint to one side of the vertices (either \(V_L\) or \(V_R\) as they are interchangeable), since in applications we are usually only interested in the similarity between “users.” Nevertheless, the general technical ideas presented in this work can also be applied to the variant of similar-biclique that imposes the similarity constraint on both sides of vertices.

We in this work aim to enumerate all maximal similar-bicliques in a bipartite graph. We prove that this problem is #P-complete. As each (maximal) similar-biclique is contained in a maximal biclique, we could first enumerate all maximal bicliques, then extract maximal similar-bicliques from maximal bicliques, and finally eliminate all similar-bicliques that are either duplicates or not maximal. However, this approach is inefficient, as enumerating all maximal bicliques by the state-of-the-art algorithm \(\textsf{ooMBEA}\) [8] is still time consuming for large graphs. Thus, we propose the \(\textsf{MSBE}\) algorithm to directly enumerate maximal similar-bicliques, without first enumerating maximal bicliques.

\(\textsf{MSBE}\) follows the general backtracking framework of the Bron–Kerbosch algorithm [5] that enumerates all maximal cliques in a unipartite graph. Our observation is that once the set of L-side vertices \(C_L\) of a similar-biclique C is determined, its R-side vertices can be simply obtained as \(C_R = \bigcap _{u\in C_L} N(u)\). Nevertheless, it is worth pointing out that we cannot ignore \(C_R\) during the enumeration process, since (1) the size of \(C_R\) will be used for pruning and (2) both \(C_L\) and \(C_R\) are needed for determining the maximality of the similar-biclique. \(\textsf{MSBE}\) iteratively builds up a partial solution \(\langle C_L,C_R\rangle \), maintains a candidate set \(P_L\) that is used to grow \(C_L\), and maintains an exclusive set \(Q_L\) that is used for checking the maximality of \(\langle C_L, C_R\rangle \). In each recursion, a vertex \(u \in P_L\) is added to \(C_L\) to grow the solution; after coming back from the recursion, u is moved from \(P_L\) to \(Q_L\) to avoid duplicates. To prune the search space, we propose the concept of dominating: \(u \in P_L\) dominates \(v \in P_L\) if \({\textsf{sim}} (u,v) \ge \varepsilon \) and \(N_{C_R}(u) \supseteq N_{C_R}(v)\), where \(N_{C_R}(u) = N(u)\cap C_R\). We prove that if u dominates v, then we can prune the recursion of adding v to \(C_L\) when u is moved from \(P_L\) to \(Q_L\). Furthermore, according to the definition, (1) each vertex u in a similar-biclique C must have at least \(\tau \) neighbors in C (i.e., \(|N_C(u)| \ge \tau \)), and (2) each L-side vertex \(u \in C_L\) must have at least \(\tau -1\) vertices that are similar to it; we call the vertices that are similar to u the similar neighbors of u, denoted \(\varGamma (u)\). Thus, we propose to first prune all vertices that violate either of these two conditions, in a preprocess referred to as vertex reduction; our empirical studies show that a large portion of the input graph is pruned by vertex reduction.

We observe that a time-critical operation, in both vertex reduction and the recursion of backtracking, is computing \(\varGamma (u)\) which would take \(O(\sum _{v \in N(u)} |N(v)|)\) time, for a vertex \(u \in V_L\). Note that, \(\varGamma (u)\) is not stored in the graph representation, and it is also not affordable to store \(\varGamma (u)\) (either in main memory or on disk) after it is computed as this would require a prohibitively large space. For example, it would take over 400GB on Bibsonomy, one of the graphs tested in our experiments. In view of this, we propose an offline-constructed index to speed up the computation of \(\varGamma (u)\); note that, our index can be used to process maximal similar-biclique enumeration queries with different \(\varepsilon \) and \(\tau \) values. This is based on the fact that for any similarity threshold \(\varepsilon \), \(\varGamma (u)\) is always a subset of \(\varPhi _u = \bigcup _{v\in N(u)} N(v)\), the 2-hop structural neighbors of u. Thus, we propose to first compute the similarity between u and every vertex of \(\varPhi _u\) in an offline process, and then compress them into a few segments which are stored in the index. Specifically, each segment is represented by \({\textsf{seg}} = \langle {\mathsf {v_{\min }}}, {\mathsf {v_{\max }}}, {\mathsf {s_{\max }}}, c\rangle \) where \({\mathsf {v_{\min }}} \le {\mathsf {v_{\max }}} \) are two vertices of \(\varPhi _u\), \(\mathsf {s_{\max }}\) is the largest similarity between u and vertices of \(\varPhi _u\) that are in the range \([{\mathsf {v_{\min }}},{\mathsf {v_{\max }}} ]\), and c is the number of vertices of \(\varPhi _u\) that are in \([{\mathsf {v_{\min }}},{\mathsf {v_{\max }}} ]\); here, the comparison between vertices is based on their ids. To obtain \(\varGamma (u)\), we go through each segment \(\textsf{seg}\) of \(\varPhi _u\) that satisfies \({\textsf{seg}}.{\mathsf {s_{\max }}} \ge \varepsilon \), and compute the similarity between u and each \(v \in [{\textsf{seg}}.{\mathsf {v_{\min }}},{\textsf{seg}}.{\mathsf {v_{\max }}} ]\); note that, segments with \({\mathsf {s_{\max }}} < \varepsilon \) are skipped. Furthermore, we also use the index to speed up vertex reduction by first pruning vertices based on upper bounds of \(\varGamma (u)\), which can be efficiently obtained based on the index without computing similarities.

Our main contributions are summarized as follows:

  • We formulate the concept of similar-biclique, which can be used to detect interesting dense subgraphs from a bipartite graph. To the best of our knowledge, this is the first work investigating structural similarity between vertices in dense bipartite subgraph mining.

  • We develop a backtracking algorithm \(\textsf{MSBE}\) to enumerate all maximal similar-bicliques in a bipartite graph. Vertex reduction and optimization techniques are proposed.

  • We design a novel index structure to facilitate the computation of similar neighbors and propose a two-phase algorithm for efficient vertex reduction based on the index.

  • We propose effective and efficient index construction algorithms by investigating two different strategies. We also parallelize our index construction algorithms by utilizing multiple CPU cores.

  • We propose index maintenance algorithms to handle dynamic graph updates.

  • Extensive empirical studies on 17 real bipartite graphs as well as case studies are conducted to demonstrate the efficiency of our algorithms and the effectiveness of our similar-biclique model. Our algorithm is up to 6 orders of magnitude faster than \(\textsf{ooMBEA}\).

A preliminary version of this work has been published in [52]. Compared with [52], we in this work have proposed a more efficient algorithm \(\textsf{TPA}\) for index construction, parallelized our index construction algorithms, proposed index maintenance algorithms to handle dynamic graph updates, and conducted new experimental studies to evaluate the efficiency and effectiveness of our techniques.

Organization. Sect. 2 reviews the related works. Section 3 provides preliminaries including the definition of similar-biclique and the problem statement. Section 4 introduces a baseline algorithm and our \(\textsf{MSBE}\) algorithm. Section 5 presents our index structure, index-based algorithms, index construction algorithms, as well as parallelization techniques for index construction. Section 6 introduces our index maintenance algorithms. Section 7 reports experimental results. Section 8 concludes the work.

2 Related works

We categorize the related works as follows.

Maximal biclique enumeration. The problem of enumerating all maximal bicliques has been widely studied. The existing studies can be classified into two categories, depending on whether the input graph is bipartite or not. When the input graph is bipartite, the existing algorithms [1, 43, 56] generally enumerate subsets of vertices from one side, and then, the intersection of their neighbors forms the other side of the biclique. Besides, frequent item-set mining techniques have also been utilized to enumerate maximal bicliques [30, 31, 49, 55], as these two problems are highly related to each other. The state-of-the-art algorithm for maximal biclique enumeration over bipartite graphs is \(\textsf{ooMBEA}\) proposed in [8], which is compared in our experiments. There are also studies that aim at enumerating all maximal (non-induced) bicliques from a general graph, i.e., the input graph is not bipartite. For example, it is studied from a theoretical viewpoint in [14], consensus algorithms are proposed in [3], and a divide-and-conquer algorithm is proposed in [34]. However, these algorithms are generally slower than the algorithms that specifically handle bipartite graphs. Moreover, none of the existing studies on maximal biclique enumeration take into account the structural similarity between vertices.

Maximal clique enumeration. The problem of enumerating all maximal cliques in a unipartite graph has also been extensively studied. The existing algorithms generally follow the backtracking framework of Bron and Kerbosch [5] with optimization techniques being proposed in [7, 9, 15, 38, 47]. However, these algorithms cannot handle bipartite graphs.

Dense bipartite subgraph identification. Other models have also been proposed for dense bipartite subgraph identification, such as quasi-biclique [35], k-biplex [54], \((\alpha ,\beta )\)-core [24], k-bitruss [59], and k-wing [44]. Quasi-biclique and k-biplex relax the biclique model by allowing each vertex in one side of the result to miss a certain number of neighbors from the other side. On the other hand, \((\alpha ,\beta )\)-core requires each vertex from one side to be connected to a certain number of vertices from the other side, and k-bitruss and k-wing require each edge to be involved in a certain number of (2, 2)-bicliques. None of these models consider similarity between vertices, and our case study in Sect. 7.3 demonstrates that similar-biclique outperforms k-biplex and \((\alpha ,\beta )\)-core in anomaly detection. Note that it is also possible to integrate similarity constraint into these dense bipartite subgraph models, in a similar way to similar-bicliques. However, this would require thorough studies, from problem hardness analysis to algorithm design, for each model. For example, although \((\alpha ,\beta )\)-core can be computed in polynomial time, computing similar-\((\alpha ,\beta )\)-cores may require an exponential time as the number of maximal similar-\((\alpha ,\beta )\)-cores could be exponential. Thus, we leave these to our future studies.

Structural vertex similarity. Structural vertex similarity refers to similarity measures between pairs of vertices that are computed based solely on the topology of the graph [28]. They are usually categorized as local-topology-based and global-topology-based similarity measures. For the former, the similarity between two vertices is computed based on their neighbors, i.e., locally. Examples include Jaccard similarity [19], cosine similarity [42], Dice’s coefficient [11], hub depressed/promoted index [36], and Adamic–Adar index [2]. For the latter, the global structure information is utilized to derive the similarity between two vertices. Examples include Katz [22], SimRank [20], C-Rank [53], P-Rank [58], and MatchSim [33]. As global-topology-based measures usually need to access the entire graph to compute the similarity, they are computationally more expensive than the local-topology-based measures. Thus, we adopt local-topology-based measures in this work.

3 Preliminary

We consider an unweighted and undirected bipartite graph \(G=(V_L,V_R,E)\), where \(V_L\) and \(V_R\) denote the two disjoint vertex sets (i.e., L-side vertices and R-side vertices) and \(E\subseteq V_L\times V_R\) denotes the edge set. Without loss of generality, we assume that vertices of \(V_L\) take (integer) ids from \(\{1,2,\ldots ,|V_L|\}\), and vertices of \(V_R\) take ids from \(\{1+|V_L|,2+|V_L|,\ldots ,|V_R|+|V_L|\}\). For any vertex \(v\in V_L\) (resp. \(v\in V_R\)), we say it is an L-side vertex (resp. R-side vertex). For any vertex subset \(C \subseteq V_L\cup V_R\), we use \(C_L\) and \(C_R\) to denote the subsets of vertices of C that are from \(V_L\) and \(V_R\), respectively, i.e., \(C_L=C\cap V_L\) and \(C_R=C\cap V_R\). We call the set of neighbors of u in G the structural neighbors of u, denoted \(N_G(u) = \{v\mid (u,v) \in E\}\), and denote the structural degree of u by \(d_G(u) = |N_G(u)|\). Besides structural neighbor and structural degree, we will also define similar neighbor and similar degree based on structural similarity.

Definition 1

(Structural Similarity) Given two vertices u and v in G, their structural similarity is defined as \(\textstyle {\textsf{sim}} (u,v) = \frac{|N_G(u)\cap N_G(v)|}{|N_G(u)\cup N_G(v)|}. \)

The structural similarity \({\textsf{sim}} (u,v)\) is between 0 and 1. It measures the Jaccard similarity between the set of structural neighbors of u and that of v. Given a similarity threshold \(\varepsilon > 0\), we say u and v are similar if \({\textsf{sim}} (u,v)\ge \varepsilon \). The set of vertices that are similar to u is called the similar neighbors of u, denoted \(\varGamma _{G,\varepsilon }(u)\), i.e., \(\varGamma _{G,\varepsilon }(u)=\{v \in V_L\cup V_R \mid {\textsf{sim}} (u,v)\ge \varepsilon \}\). Accordingly, denote \(\delta _{G,\varepsilon }(u) = |\varGamma _{G,\varepsilon }(u)|\) the similar degree of u. Note that as the structural similarity between vertices from different sides is always 0, similar neighbors of vertex u only contain vertices from the same side as u. For presentation simplicity, we call structural similarity simply as similarity and omit the subscripts G and/or \(\varepsilon \) from the notations.

Definition 2

(Similar-Biclique) Given a bipartite graph \(G=(V_L,V_R,E)\) and a similarity threshold \(\varepsilon > 0\), a vertex subset \(C \subseteq V_L\cup V_R\) is a similar-biclique if

  • C is a biclique, i.e., \(C_L\times C_R \subseteq E\), and

  • all vertices from the L-side are similar to each other, i.e., \({\textsf{sim}} (u,v) \ge \varepsilon , \forall u,v \in C_L\).

We also denote C as \(\langle C_L, C_R\rangle \). A similar-biclique is maximal if it is not a subset of any larger similar-biclique.

Note that for presentation simplicity, the similarity constraint is assumed to be considered for the L-side vertices. To apply the similarity constraint for R-side vertices in applications, we can simply swap the roles of \(V_L\) and \(V_R\) in G.

Fig. 2
figure 2

Maximal biclique versus maximal similar-biclique

Example 1

Figure 2a shows a bipartite graph G with \(V_L=\{v_0,\ldots ,v_3\}\) and \(V_R=\{v_4,\ldots ,v_7\}\), in which the subgraph C in the gray area is a maximal biclique. However, \(v_1\) and \(v_2\) in the L-side of C are not similar to each other for \(\varepsilon = 0.6\), i.e., \({\textsf{sim}} (v_1,v_2)=0.5\). Subgraphs \(C_1\) and \(C_2\) as shown in Fig. 2b are two maximal similar-bicliques, in each of which all vertices on the L-side are similar to each other.

Problem 1

(Maximal Similar-Biclique Enumeration) Given a bipartite graph \(G=(V_L,V_R,E)\), a similarity threshold \(\varepsilon > 0\), and a size constraint \(\tau > 0\), we study the problem of enumerating all maximal similar-bicliques C in G that satisfy the size constraint \(\tau \) (i.e., \(|C_L|\ge \tau \) and \(|C_R|\ge \tau \)).

The size constraint \(\tau \) is adopted here to avoid generating too small or too skewed similar-bicliques (i.e., with very few or no vertices in one side). For presentation simplicity, we assume the same size constraint \(\tau \) for both sides. Note that the techniques that we are going to present in this work can be straightforwardly extended to handle different size constraints on the two sides.

Theorem 1

The problem of enumerating all maximal similar-bicliques is #P-complete.

Proof

The #P-completeness of our problem directly follows from the facts that (1) the problem of enumerating all maximal bicliques is #P-complete [23, 26], and (2) it is a special case of our problem, i.e., by setting \(\varepsilon = \frac{1}{|V_R|}\). Note that, for this small \(\varepsilon \), two vertices of \(V_L\) are similar if and only if they have at least one common neighbors in \(V_R\). Thus, every maximal similar-biclique is also a maximal biclique, and vice versa. \(\square \)

Remark about Structural Similarity. In Definition 1, we use the Jaccard similarity to measure the structural similarity since it has been widely used and shown great success in many graph analysis tasks, such as structural graph clustering [6, 48], link prediction [32, 39], and local graph sparsification [45]. Nevertheless, other local-topology-based similarity measures reviewed in Sect. 2, such as cosine similarity \(\frac{|N_G(u)\cap N_G(v)|}{\sqrt{d_G(u)\times d_G(v)}}\) and hub promoted index \(\frac{|N_G(u)\cap N_G(v)|}{\min \{d_G(u),d_G(v)\}}\), can be easily plugged into our model and algorithms. We will point out the changes that need to be made to adopt these measures, when presenting our algorithms.

4 Our algorithms

In this section, we propose an \(\textsf{MSBE}\) algorithm to enumerate all maximal similar-bicliques. Before that, we first in Sect. 4.1 present a baseline algorithm based on the existing maximal biclique enumeration algorithms.

4.1 A baseline algorithm

It is easy to observe that maximal similar-bicliques must be contained in maximal bicliques, since every similar-biclique is also a biclique. Thus, a naive approach is first enumerating all maximal bicliques by invoking one of the existing maximal biclique enumeration algorithms, and then post-processing the detected maximal bicliques to obtain maximal similar-bicliques. Specifically, for each maximal biclique, we extract maximal similar-bicliques by imposing the similarity constraint on L-side vertices, and then eliminate all similar-bicliques that are either duplicates or non-maximal. We omit the details, since our empirical study in Sect. 7 shows that enumerating all maximal bicliques by the state-of-the-art algorithm \(\textsf{ooMBEA}\) [8] is already time consuming for large graphs.

4.2 Our \(\textsf{MSBE}\) algorithm

According to the definition of similar-biclique, if we build a similarity graph \(G_s\) for \(V_L\) where two vertices of \(V_L\) are connected by an edge if their similarity is at least \(\varepsilon \), then for every (maximal) similar-biclique C, its L-side vertices \(C_L\) form a clique in \(G_s\). Moreover, once the L-side vertices \(C_L\) of a maximal similar-biclique are determined, the R-side vertices \(C_R\) can be easily obtained as the set of common neighbors of \(C_L\), i.e., \(C_R = \bigcap _{u\in C_L} N(u)\). In view of this, we propose to adopt the general backtracking framework of the Bron–Kerbosch algorithm [5] to enumerate all maximal similar-bicliques. However, there are two issues. (1) The similarity graph \(G_s\) is not available in the input. (2) The set of L-side vertices \(C_L\) of a maximal similar-biclique C is not necessarily a maximal clique in \(G_s\), though \(C_L\) is a clique in \(G_s\). This is because, a too large \(C_L\) may result in a too small \(C_R\), violating the size constraint \(\tau \) on \(C_R\).

Algorithm 1
figure a

\({\textsf{MSBE}} (G=(V_L, V_R, E), \varepsilon , \tau )\)

We propose techniques to address the above issues, and the pseudocode of our algorithm is shown in Algorithm 1, denoted \(\textsf{MSBE}\). We first prune unpromising vertices by invoking \(\textsf{VReduce}\) (Lines 1–2), which will be introduced shortly in Algorithm 2; a vertex u is pruned if . For each remaining vertex \(u\in V_L\) with , we enumerate all maximal similar-bicliques containing u (Lines 3–10). To do so, we iteratively grow a partial solution \(\langle C_L, C_R\rangle \) where \(C_L\) is initialized as \(\{u\}\) (Line 4). Besides \(C_L\) and \(C_R\), we also maintain \(P_L\) and \(Q_L\) in a similar fashion to the Bron–Kerbosch algorithm [5], such that each vertex of \(P_L\cup Q_L\) is similar to all vertices of \(C_L\). Specifically, \(P_L\) is a set of candidate vertices that are used to grow \(C_L\), and \(Q_L\) is a set of previously considered candidate vertices that are used for checking the maximality of \(\langle C_L, C_R\rangle \). \(P_L\) and \(Q_L\) are initialized at Lines 5–9, after which we invoke the procedure \(\textsf{Enum}\) for enumeration (Line 10); to avoid duplicates, the similar neighbors of u with larger vertex id than u are put in \(P_L\), and those with smaller vertex id are put in \(Q_L\).

In \(\textsf{Enum}\), if the current similar-biclique \(\langle C_L,C_R\rangle \) is maximal, we report it (Lines 11–12); note that \(\langle C_L,C_R\rangle \) is maximal if and only if there is no vertex \(u\in P_L\cup Q_L\) such that \(N(u)\supseteq C_R\). Next, \(\textsf{Enum}\) iteratively adds a vertex of \(P_L\) to \(C_L\), updates the corresponding \(C_R\), \(P_L\) and \(Q_L\), and recursively invokes itself to enumerate more similar-bicliques (Lines 14–17). After processing \(u\in P_L\), we remove u from \(P_L\) and add it to \(Q_L\) (Line 18).

Vertex Reduction. As a similar-biclique needs to have at least \(\tau \) vertices on each side, each vertex u in a similar-biclique C must have at least \(\tau \) structural neighbors in C (i.e., \(|N_C(u)| \ge \tau \)). Furthermore, as all L-side vertices in a similar-biclique C are similar to each other, each L-side vertex u must have at least \(\tau -1\) similar neighbors in C (i.e., \(|\varGamma _C(u)| \ge \tau -1\)). As a result, we can exclude all vertices that violate either of these two conditions from being considered in the enumeration procedure \(\textsf{Enum}\), i.e., mark them as deleted; we call this process as vertex reduction.

Algorithm 2
figure b

\({\textsf{VReduce}} (G, \varepsilon , \tau , {\textsf{del}} (\cdot ))\)

We propose an algorithm \(\textsf{VReduce}\) to conduct vertex reduction, whose pseudocode is shown in Algorithm 2. Firstly, we obtain the structural degree d(u) for each vertex \(u \in V_L\cup V_R\) (Line 1), and obtain the similar degree \(\delta (u)\) for each L-side vertex \(u \in V_L\) (Line 2). Then, as long as there is a non-deleted vertex u with \(d(u)<\tau \) or a non-deleted L-side vertex u with \(\delta (u)<\tau -1\), we mark u as deleted (Line 8), decrease the structure degree of u’s structural neighbors by 1 (Line 4), and also decrease the similar degree of u’s similar neighbors by 1 if u is an L-side vertex (Lines 5–7).

Compute Similar Neighbors \(\varGamma (u)\). One fundamental operation in both Algorithm 1 and Algorithm 2 is computing \(\varGamma (u)\) for an L-side vertex u; note that \(\varGamma (u)\) is not stored with the graph G. A straightforward method to collect \(\varGamma (u)\) is computing \({\textsf{sim}} (u,v)\) for each vertex \(v\in V_L\). The time complexity would be O(|E|), as it needs to visit every edge of G. This is inefficient, by noting that Algorithm 1 and Algorithm 2 need to compute \(\varGamma (u)\) for many vertices u and multiple times.

Algorithm 3
figure c

\({\textsf{SimNei}} (G, u,\varepsilon , {\textsf{del}} (\cdot ))\)

We propose a more efficient algorithm in Algorithm 3, denoted \(\textsf{SimNei}\). Instead of blindly computing \({\textsf{sim}} (u,v)\) for each \(v \in V_L\), we only compute \({\textsf{sim}} (u,v)\) for those v with \({\textsf{sim}} (u,v) > 0\). Our main idea is to first compute the number of common neighbors c(v) between u and v for each 2-hop structural neighbor v of u (Lines 3–4). Then, \({\textsf{sim}} (u,v)\) can be calculated as \(\frac{c(v)}{d(u) + d(v) - c(v)}\) (Line 6).Footnote 1 Note that, in our implementation, to make the time complexity of \(\textsf{SimNei}\) to be independent of \(|V_L|\) which may be large, we only initialize \(c(\cdot )\) once at the beginning of the entire algorithm execution (e.g., in \(\textsf{MSBE}\)), and after using \(c(\cdot )\) at Line 4–6 of Algorithm 3 we reset those updated \(c(\cdot )\) to be 0. In addition, we also collect at Line 4 the set of vertices with non-zero \(c(\cdot )\) into a set S, such that Line 5 as well as the resetting of \(c(\cdot )\) can be conducted in O(|S|) time. As a result, the time complexity of \(\textsf{SimNei}\) is \(O(\sum _{v \in N(u)} d(v))\), which is lower than O(|E|).

Optimizations for \(\textsf{Enum}\). We further propose two optimization techniques to speed up the \(\textsf{Enum}\) procedure. Recall that, an instance of \(\textsf{Enum}\) is represented by \((C_L, C_R, P_L, Q_L)\) where \(C_R = \bigcap _{u\in C_L} N(u)\) and \(|C_R| \ge \tau \),Footnote 2 and aims to enumerate all maximal similar-bicliques \(C^*\) satisfying \(C_L \subset C_L^* \subseteq C_L\cup P_L\). Firstly, an enumeration instance can be terminated once we know that it will not generate any maximal similar-biclique. That is, by including any subset of vertices of \(P_L\) into \(C_L\), the resulting similar-biclique would not be maximal. This is formulated in the lemma below.

Lemma 1

(Early Termination) If there exists a vertex \(u\in Q_L\) such that u is similar to all vertices of \(P_L\) and \(N(u) \supseteq C_R\), then there is no maximal similar-biclique \(C^*\) with \(C_L \subset C^*_L \subseteq C_L\cup P_L\) and thus we can terminate this enumeration instance.

Proof

Suppose there is such a maximal similar-biclique \(C^*\) with \(C_L \subset C_L^* \subseteq C_L\cup P_L\). Then, we must have \(C_R^* \subseteq C_R\) and thus \(N(u)\supseteq C_R \supseteq C_R^*\). Since \(u \in Q_L\) is similar to all vertices of \(P_L\) (and also note that all vertices of \(Q_L\), including u, are similar to all vertices of \(C_L\) according to the construction of \(Q_L\)), u is similar to all vertices of \(C_L^*\). Consequently, \(C^*\cup \{u\}\) is also a similar-biclique, contradicting that \(C^*\) is a maximal similar-biclique. \(\square \)

Secondly, we can reduce the number of instances generated at Line 17 of Algorithm 1, based on the concept of dominating set.

Definition 3

(Dominating Set) Given an instance \((C_L, C_R, P_L, Q_L)\) of \(\textsf{Enum}\), for two distinct vertices \(u,v\in P_L\cup Q_L\), we say that u dominates v if \({\textsf{sim}} (u,v)\ge \varepsilon \) and \(N_{C_R}(u)\supseteq N_{C_R}(v)\), where \(N_{C_R}(u) = N(u)\cap C_R\). The dominating set of u, denoted \({\textsf{DomSet}} (u)\), is the subset of vertices of \(P_L\) that are dominated by u, i.e., \({\textsf{DomSet}} (u)=\{v\in P_L\mid {\textsf{sim}} (u,v)\ge \varepsilon \wedge N_{C_R}(u)\supseteq N_{C_R}(v)\}\).

Note that, a vertex does not dominate itself.

Lemma 2

Given an instance \((C_L, C_R, P_L, Q_L)\) of \(\textsf{Enum}\) and a vertex \(u^*\in P_L\cup Q_L\), any maximal similar-biclique \(C^*\) with \(C_L \subset C^*_L \subseteq C_L\cup P_L\) must contain a vertex of \(P_L\backslash {\textsf{DomSet}} (u^*)\).

Proof

Suppose there is such a maximal similar-biclique \(C^*\) with \(C_L \subset C_L^* \subseteq C_L\cup P_L\) that contains no vertex of \(P_L\backslash {\textsf{DomSet}} (u^*)\). That is, \(C^*_L \backslash C_L \subseteq {\textsf{DomSet}} (u^*)\). Then, it is easy to verify that \(C^*\cup \{u^*\}\) is also a similar-biclique, contradicting the maximality of \(C^*\). \(\square \)

Fig. 3
figure 3

Example of domination

Example 2

Consider the instance in Fig. 3 where \(Q_L = \emptyset \). For \(\varepsilon = 0.6\), \(v_1\in P_L\) is similar to both \(v_2\) and \(v_3\). Moreover, we can see that \(N_{C_R}(v_1) \supseteq N_{C_R}(v_2)\) and \(N_{C_R}(v_1) \supseteq N_{C_R}(v_3)\). Thus, \({\textsf{DomSet}} (v_1) = \{v_2,v_3\}\), and we know that every maximal similar-biclique \(C^*\) with \(C_L \subset C_L^* \subseteq C_L\cup P_L\) must contain \(v_1\) since \(P_L\backslash {\textsf{DomSet}} (v_1)\) \(= \{v_1\}\).

To apply Lemma 2, we revise Line 13 of Algorithm 1 as follows: we first choose a vertex \(u^*\) from \(P_L\cup Q_L\) before the “for loop,” and then replace “\(u\in P_L\)” with “\(u\in P_L \backslash {\textsf{DomSet}} (u^*)\)” in the “for loop” statement. This means that we do not generate enumeration instances, at Line 17 of Algorithm 1, for vertices \(u \in {\textsf{DomSet}} (u^*)\). To maximize the benefit of Lemma 2, \(u^*\) is chosen as the one that minimizes \(|P_L \backslash {\textsf{DomSet}} (u^*)|\) among all vertices of \(P_L \cup Q_L\). If \(u^*\) dominates a substantial number of vertices in \(P_L\) (i.e., when the cardinality of \({\textsf{DomSet}} (u^*)\) is large), applying Lemma 2 can significantly reduce the overall running time.

Theorem 2

The time complexity of Algorithm 1 is \( {{\mathcal {O}}}(|V_L|\cdot |E|\cdot 2^{|V_L|})\).

Proof

We first prove that the time complexity of \(\textsf{VReduce}\) (i.e., Algorithm 2) is \( {{\mathcal {O}}}(\sum _{v \in V_R} (d(v))^2)\). In Algorithm 2, Line 1 runs in \( {{\mathcal {O}}}(|E|)\) time. Line 2 runs in \( {{\mathcal {O}}}(\sum _{v\in V_R} (d(v))^2)\) time, since computing \(\varGamma (u)\) by Algorithm 3 takes time \( {{\mathcal {O}}}(\sum _{v\in N(u)} d(v))\) as discussed above, by noting that

$$\begin{aligned}&\sum _{u \in V_L} \sum _{v \in N(u)} d(v) = \sum _{(u,v) \in E} d(v) \\ =&\sum _{v \in V_R} \sum _{u \in N(v)} d(v) = \sum _{v \in V_R} (d(v))^2 \end{aligned}$$

Since each vertex \(u \in V_L\cup V_R\) is removed at most once at Lines 4–8, the total cost of running Line 4 for all deleted vertices is \( {{\mathcal {O}}}(|E|)\), and the total cost of running Lines 6–7 for all deleted vertices is \( {{\mathcal {O}}}(\sum _{v\in V_R} (d(v))^2)\), the same as that of Line 2. In addition, we use a queue to store the vertices that should be deleted (i.e., satisfying the conditions at Line 3) such that finding a vertex at Line 3 takes constant time. Thus, the time complexity of \(\textsf{VReduce}\) is \( {{\mathcal {O}}}(\sum _{v\in V_R} (d(v))^2)\).

Now, we prove that the time complexity of \(\textsf{MSBE}\) (Algorithm 1) is \( {{\mathcal {O}}}(|V_L|\cdot |E|\cdot 2^{|V_L|})\). Firstly, it is easy to see that the time complexity of Algorithm 1 excluding Line 10 (i.e., the recursion) is \( {{\mathcal {O}}}(\sum _{v\in V_R} (d(v))^2)\), by following the analysis of the time complexity of \(\textsf{VReduce}\). Secondly, invoking \(\textsf{Enum}\) with input \((C_L,C_R,P_L,Q_L)\) takes time \( {{\mathcal {O}}}(|E|\cdot 2^{|P_L|})\), since the recursion builds a complete binary search tree with each instance \((C_L,C_R\), \(P_L,Q_L)\) has two children: one including u into \(C_L\), one excluding u from \(C_L\). The time for generating the first child (Lines 14–17) is \( {{\mathcal {O}}}(\sum _{v\in N(u)} d(v)) \subseteq {{\mathcal {O}}}(|E|)\), and the time for generating the second child is \( {{\mathcal {O}}}(1)\) (Line 18). In addition, the total number of leaf instances is \(2^{P_L}\). Thus, each invocation to \(\textsf{Enum}\) at Line 10 of Algorithm 1 takes time \( {{\mathcal {O}}}(|E|\cdot 2^{|P_L|}) \subseteq {{\mathcal {O}}}(|E|\cdot 2^{|V_L|})\), and the total time complexity follows. \(\square \)

Discussions. \(\textsf{MSBE}\) is different from both maximal clique enumeration algorithms for unipartite graphs and maximal biclique enumeration algorithms for bipartite graphs, as follows. Firstly, \(\textsf{MSBE}\) needs to compute the similar neighbors for L-side vertices, which are not required by any of the existing algorithms. Secondly, compared with maximal clique enumeration algorithms, \(\textsf{MSBE}\) needs to consider common structural neighbors \(C_R\) of \(C_L\), in addition to common similar neighbors. Thirdly, our optimization techniques for \(\textsf{Enum}\) are also different from the existing ones.

In \(\textsf{MSBE}\), we need to obtain the similar neighbors \(\varGamma (\cdot )\) of an L-side vertex multiple times, e.g., at Lines 6 and 15 of Algorithm 1 and Lines 2 and 6 of Algorithm 2. We can either invoke \(\textsf{SimNei}\) to compute \(\varGamma (u)\) every time when it is needed or store \(\varGamma (u)\) in main memory after it is computed for the first time and then directly retrieve it for all subsequent requests. We use \(\textsf{MSBE}\) to denote the algorithm that uses the first strategy, and \(\mathsf {mat\text {-}MSBE}\) the algorithm that uses the second strategy. (Here, \(\textsf{mat}\) stands for materialization.)

5 Speeding up similar neighbor computation and vertex reduction

\(\textsf{MSBE}\) has the disadvantage of repeatedly computing the similar neighbors from scratch which is time consuming, while \(\mathsf {mat\text {-}MSBE}\) may demand an extremely large main memory to store the similar neighbors. For example, it would take more than 400GB main memory for the graph \(\textsf{bibsonomy}\) used in our experiments even for a moderate \(\varepsilon = 0.5\). In this section, we propose an offline-constructed index to speed up the computation of \(\varGamma (u)\) as well as vertex reduction. We give an overview of the index structure in Sect. 5.1, present our index-based algorithms in Sect. 5.2, discuss index construction in Sect. 5.3, and finally parallelize our index construction algorithm in Sect. 5.4.

5.1 Overview of index structure

Let \(\varPhi _u\) be the set of 2-hop structural neighbors of u, i.e., \(\varPhi _u = \bigcup _{v \in N(u)} N(v)\). Firstly, we have the following lemma.

Lemma 3

For any similarity threshold \(\varepsilon >0\), the set of similar neighbors of u is a subset of \(\varPhi _u\), i.e., \(\varGamma _\varepsilon (u)\subseteq \varPhi _u\).

Proof

The correctness of the lemma directly follows from the fact that any vertex \(v \notin \varPhi _u \cup \{u\}\) has no common neighbor with u and thus \({\textsf{sim}} (u,v) = 0\). \(\square \)

Based on Lemma 3, one possible indexing strategy is pre-computing and storing \({\textsf{sim}} (u,v)\) for each \(u \in V_L\cup V_R\) and each \(v \in \varPhi _u\).Footnote 3 However, the space complexity of this strategy would be \(O(|V_L|^2 + |V_R|^2)\), which is prohibitively high even for moderate-sized graphs since the space requirement is essentially the same as the case of \(\mathsf {mat\text {-}MSBE}\) when \(\varepsilon \) is very small. For example, even for a moderate-sized graph with \(10^6\) vertices, the storage space would be over 2TB.

Instead of storing \(\varPhi _u\) and the structural similarities in their raw format, we summarize them into a few segments.

Definition 4

(Segment) A segment, denoted \({\textsf{seg}} \), of \(\varPhi _u\) is a four-tuple \(\langle {\mathsf {v_{\min }}}, {\mathsf {v_{\max }}}, {\mathsf {s_{\max }}}, c\rangle \), where \({\mathsf {v_{\min }}} \le {\mathsf {v_{\max }}} \) are two vertices of \(\varPhi _u\),

$$\begin{aligned} {\mathsf {s_{\max }}} = \max _{v \in \varPhi _u: {\mathsf {v_{\min }}} \le v \le {\mathsf {v_{\max }}}} {\textsf{sim}} (u,v) \end{aligned}$$

and \(c = |\{v \in \varPhi _u \mid {\mathsf {v_{\min }}} \le v \le {\mathsf {v_{\max }}} \}|\). Here, vertex comparison is based on vertex id.

Given a segment \({\textsf{seg}} = \langle {\mathsf {v_{\min }}}, {\mathsf {v_{\max }}}, {\mathsf {s_{\max }}}, c\rangle \) of \(\varPhi _u\), we use \(V({\textsf{seg}})\) to denote \(\{v \in \varPhi _u \mid {\mathsf {v_{\min }}} \le v \le {\mathsf {v_{\max }}} \}\). It is immediate from the definition that \(c = |V({\textsf{seg}})|\) and

  • \({\mathsf {v_{\min }}} \) (resp. \({\mathsf {v_{\max }}} \)) is the smallest (resp. largest) vertex id in \(V({\textsf{seg}})\);

  • \({\mathsf {s_{\max }}} \) is the largest similarity between u and a vertex of \(V({\textsf{seg}})\), and thus \({\mathsf {s_{\max }}} \) provides an upper bound of \({\textsf{sim}} (u,v)\) for all \(v \in V({\textsf{seg}})\).

Thus, we say that \({\textsf{seg}} \) covers vertices \(V({\textsf{seg}})\). A set of segments \(\mathbb {S}_u = \{{\textsf{seg}} _1,\ldots ,{\textsf{seg}} _k\}\) covers \(\varPhi _u\) if

$$\begin{aligned} \bigcup _{{\textsf{seg}} \in \mathbb {S}_u} V({\textsf{seg}}) = \varPhi _u \end{aligned}$$

In this work, we only consider disjoint segments, i.e., \(V({\textsf{seg}} _i) \cap V({\textsf{seg}} _j) = \emptyset \) for \(i \ne j\). Our index structure, denoted \(\mathcal {I}\), covers \(\varPhi _u\) by a set of segments, for all \(u \in V_L\cup V_R\). That is, \(\mathcal {I}\) consists of \(\mathbb {S}_u\) such that \(\mathbb {S}_u\) covers \(\varPhi _u\), for all \(u \in V_L\cup V_R\).

Fig. 4
figure 4

Overview of index structure

Example 3

Figure 4 shows the 2-hop structural neighbors \(\varPhi _u\) of u, which are sorted in increasing order regarding vertex id. The decimal below each vertex is the similarity w.r.t. u. \(\varPhi _u\) is covered by three segments \({\textsf{seg}} _1\), \({\textsf{seg}} _2\), \({\textsf{seg}} _3\). Take \({\textsf{seg}} _1\) as an example, the two numbers in the first row (i.e., 1 and 3) represent \({\mathsf {v_{\min }}} \) and \({\mathsf {v_{\max }}} \), and the two numbers in the second row (i.e., 0.5 and 3) represent \({\mathsf {s_{\max }}} \) and c.

5.2 Index-based algorithms

In this subsection, we present index-based algorithms for similar neighbor computation and for vertex reduction.

Algorithm 4
figure d

\({\textsf{indexedSN}} (u,\varepsilon ,G,\mathcal {I}, {\textsf{del}} (\cdot ))\)

Index-based Similar Neighbor Computation. The pseudocode of using the index \(\mathcal {I}\) to efficiently obtain the similar neighbors \(\varGamma (u)\) for a vertex u is shown in Algorithm 4, denoted \(\textsf{indexedSN}\). We go through each segment \({\textsf{seg}} \in \mathbb {S}_u\) with \({\textsf{seg}}.{\mathsf {s_{\max }}} \ge \varepsilon \) (Line 2), and compute \({\textsf{sim}} (u,v)\) for each \(v \in [{\textsf{seg}}.{\mathsf {v_{\min }}}, {\textsf{seg}}.{\mathsf {v_{\max }}} ]\) (Line 3); recall that (1) \({\textsf{seg}}.{\mathsf {s_{\max }}} \) upper bounds \({\textsf{sim}} (u,v)\) for each \(v \in V({\textsf{seg}})\), and (2) \(V({\textsf{seg}})\) is not stored in the index structure \(\mathcal {I}\). As computing \({\textsf{sim}} (u,v)\) needs to intersect two sets N(u) and N(v) which is costly, we propose to first apply a filtering for the pair u and v based on an upper bound \({\textsf{ub}} (u,v)\) of \({\textsf{sim}} (u,v)\) (Line 5); if \({\textsf{ub}} (u,v) < \varepsilon \), then we have \({\textsf{sim}} (u,v) \le {\textsf{ub}} (u,v) < \varepsilon \) and thus \(v \notin \varGamma (u)\). For the similarity in Definition 1, it is easy to verify that \({\textsf{sim}} (u,v) \le \frac{\min \{d(u),d(v)\}}{\max \{d(u),d(v)\}}\); we set this as \({\textsf{ub}} (u,v)\), which can be calculated in constant time.Footnote 4\(\textsf{indexedSN}\) is expected to run faster than \(\textsf{SimNei}\) (Algorithm 3) as the former can skip an entire segment if its \({\mathsf {s_{\max }}} \) is smaller than \(\varepsilon \).

Index-based Two-Phase Vertex Reduction. Based on \(\textsf{indexedSN}\), we can speed up \(\textsf{VReduce}\) (Algorithm 2) by invoking \(\textsf{indexedSN}\) to compute \(\varGamma (u)\). However, this is still inefficient, as \(\textsf{VReduce}\) needs to compute \(\varGamma (u)\) for all \(u \in V_L\) (see Line 2 of Algorithm 2). We propose to utilize the index \(\mathcal {I}\) to first obtain an upper bound of the similar degree for vertex reduction, as proved in the lemma below.

Lemma 4

(Upper Bound of Similar Degree) Let \(\mathbb {S}_u\) be the set of segments that cover \(\varPhi _u\). Then, the similar degree \(\delta _\varepsilon (u)\) of u is upper bounded by

$$\begin{aligned} \sum _{{\textsf{seg}} \in \mathbb {S}_u:\,{\textsf{seg}}.{\mathsf {s_{\max }}} \ge \varepsilon } {\textsf{seg}}.c \end{aligned}$$

Proof

This lemma directly follows from the fact that \({\textsf{sim}} (u,v) < \varepsilon \) for all \(v \in \bigcup _{{\textsf{seg}} \in \mathbb {S}_u:\,{\textsf{seg}}.{\mathsf {s_{\max }}} < \varepsilon } V({\textsf{seg}})\). \(\square \)

Consider the part of the index in Fig. 4 and suppose \(\varepsilon = 0.4\). By scanning \(\mathbb {S}_u\), we obtain an upper bound of u’s similar degree as 6, i.e., \({\textsf{seg}} _1.c+{\textsf{seg}} _3.c=6\); \({\textsf{seg}} _2\) is omitted since its \({\mathsf {s_{\max }}} \) is only 0.2.

Algorithm 5
figure e

\({\textsf{indexedVR}} (G, \mathcal {I}, \varepsilon , \tau , {\textsf{del}} (\cdot ))\)

Furthermore, we also observe that the structural degree can be obtained efficiently. Thus, we propose a two-phase approach for vertex reduction, which first conducts vertex reduction by using structural degree and upper bound of similar degree in Phase-I, and then using structural degree and similar degree in Phase-II. The pseudocode of our two-phase vertex reduction is shown in Algorithm 5, denoted \(\textsf{indexedVR}\). In Phase-I, we first obtain the structural degree d(u) for each \(u \in V_L\cup V_R\) (Line 1), and an upper bound \(\bar{\delta }(u)\) of the similar degree for each vertex \(u\in V_L\) (Line 2). Then, as long as there is a non-deleted vertex \(u\in V_L\cup V_R\) satisfying \(d(u)<\tau \) or a non-deleted vertex \(u\in V_L\) satisfying \(\bar{\delta }(u)<\tau -1\) (Line 3), we mark u as deleted and update the structural degree of its structural neighbors (Lines 4–5); note that, we do not update \(\bar{\delta }(\cdot )\) in Phase-I. In Phase-II, we first compute a progressive similar degree, denoted \(\delta _p(\cdot )\), for each non-deleted L-side vertex, by invoking \(\textsf{progressiveSN}\) (Lines 7–8). Here, \(\delta _p(u)\) is a lower bound of u’s similar degree \(\delta (u)\), and it records the number of similar neighbors that have been computed for u; our computation of \(\delta _p(u)\) ensures that \(\delta _p(u) \ge \tau -1\) if and only if \(\delta (u) \ge \tau -1\). Then, as long as there is a non-deleted vertex \(u\in V_L\cup V_R\) satisfying \(d(u)<\tau \) or a non-deleted vertex \(u\in V_L\) satisfying \(\delta _p(u) < \tau -1\), we mark u as deleted (Line 18) and update the structural degree of its structural neighbors (Line 10). Furthermore, if u is an L-side vertex, we also obtain the set \(\varGamma (u)\) of similar neighbors of u (Line 12) and update the progressive similar degree \(\delta _p(v)\) to satisfy the invariant that \(\delta _p(v) \ge \tau -1\) if and only if \(\delta (v)\ge \tau -1\) for each \(v \in \varGamma (u)\) (Lines 13–17). Note that, in our implementation, we use a queue to maintain the vertices that satisfy the condition at Line 3 or Line 9; as a result, we do not need to loop through all non-deleted vertices to find the unpromising vertices.

In Algorithm 5, for an L-side vertex u, we compute \(\delta _p(u)\) instead of \(\delta (u)\). Our main motivation is that for an L-side vertex u satisfying \(d(u) \ge \tau \), we only need to compute \(\tau -1\) of its similar neighbors to certify that it is a promising vertex. That is, we stop the computation of \(\varGamma (u)\) once \(\delta _p(u) \ge \tau -1\); however, if some similar neighbors of u are later removed (i.e., marked as deleted), then we need to update \(\delta _p(u)\) by computing more similar neighbors of u (Lines 15–17 of Algorithm 5). As a result, for vertices with high similar degrees in the remaining graph (i.e., obtained by removing all unpromising vertices), we only need to compute a small portion of their similar neighbors to prevent them from being removed and thus save unnecessary similar neighbor computations. The pseudocode of computing \(\delta _p(u)\) is shown in Lines 19–26 of Algorithm 5, denoted \(\textsf{progressiveSN}\). It is invoked only when \(\delta _p(u) < \tau -1\) and there are still unchecked segments of \(\varPhi _u\). In \(\textsf{progressiveSN}\), we check the segments of \(\mathbb {S}_u\) one by one (Line 20–24) and stop once we have found enough similar neighbors for u (Line 25). We record the index of the first unchecked segment in \({\textsf{idx}} (u)\) (Line 8). Note that in Algorithm 5, a copy of \({\textsf{del}} (\cdot )\) is stored in \({\textsf{del2}} (\cdot )\) at Line 6. The usage of \({\textsf{del2}} (\cdot )\) is specific to the \(\textsf{progressiveSN}\) procedure. This distinction of \({\textsf{del}} (\cdot )\) and \({\textsf{del2}} (\cdot )\) is necessary because for each L-side vertex v, we compute its similar neighbors and thus similar degree \(\delta _p(v)\) progressively (Lines 8 and 16). When we remove an unqualified vertex u at Lines 10–18, we also decrement \(\delta _p(v)\) by one for each similar neighbor v of u (Line 14). Consequently, it is possible that u has not yet been considered in \(\delta _p(v)\) when u is removed, but \(\delta _p(v)\) is decremented by one due to the removal of u. As a remedy, \(\textsf{progressiveSN}\) considers all vertices that are alive when entering the while loop of Line 9, i.e., all vertices with \({\textsf{del2}} (\cdot )\) being \({\textsf{false}} \).

\(\textsf{indexedVR}\) is better than \(\textsf{VReduce}\) (Algorithm 2), since (1) Phase-I of \(\textsf{indexedVR}\) is lightweight but very effective at pruning vertices as demonstrated by our empirical studies, and (2) \(\textsf{indexedVR}\) uses \(\textsf{indexedSN}\) and \(\textsf{progressiveSN}\) to compute the similar neighbors.

Overall Algorithm. Our index-based \(\textsf{MSBE}\) improves on Algorithm 1 by replacing the invocation to \(\textsf{VReduce}\) at Line 2 with invoking \(\textsf{indexedVR}\) for vertex reduction, and invoking \(\textsf{indexedSN}\) to compute \(\varGamma (u)\) at Lines 6 and 15. Nevertheless, the time complexity of index-based \(\textsf{MSBE}\) remains \( {{\mathcal {O}}}(|V_L|\cdot |E|\cdot 2^{|V_L|})\) as proved in Theorem 2, by noting that the time complexity of \(\textsf{indexedSN}\) remains \( {{\mathcal {O}}}(|E|)\). Despite of having the same time complexity, our empirical studies in Sect. 7 show that the index-based approach can improve the efficiency of \(\textsf{MSBE}\) by several orders of magnitude.

5.3 Index construction

In this subsection, we present two algorithms to construct the index based on the ideas of largest gap and steady segment, respectively. Note that, the indexes are constructed offline, and once constructed, they can be used to process maximal similar-biclique enumeration queries with different \(\varepsilon \) and \(\tau \) values.

Largest Gap (LG) Index. Recall that, our index structure summarizes a subset of vertices of \(\varPhi _u\) and their similarities to a vertex u by four numbers \({\textsf{seg}} = \langle {\mathsf {v_{\min }}}, {\mathsf {v_{\max }}}, {\mathsf {s_{\max }}}, c\rangle \), where \({\mathsf {s_{\max }}} \) is an upper bound of the similarity between u and each \(v \in \varPhi _u\) such that \({\mathsf {v_{\min }}} \le v \le {\mathsf {v_{\max }}} \). To obtain the similar neighbors of u that are in the range \([{\mathsf {v_{\min }}},{\mathsf {v_{\max }}} ]\), we need to go through each vertex \(v \in [{\mathsf {v_{\min }}},{\mathsf {v_{\max }}} ]\) and test its similarity with u (e.g., see Line 3 of Algorithm 4) even if \(v \notin V({\textsf{seg}})\). We call a vertex v that is in the range \([{\mathsf {v_{\min }}},{\mathsf {v_{\max }}} ]\) but not in \(V({\textsf{seg}})\) a fake vertex.

Intuitively, we should minimize the number of fake vertices when constructing the index. We call the index built by this strategy the largest gap (LG) index. It is constructed as follows. Suppose we are going to cover \(\varPhi _u\) by k segments. This is equivalent to find \(k-1\) cut points in the sequence of vertices of \(\varPhi _u\) that are ordered in increasing vertex id order; denote the sequence of vertices as \(\{v_1,v_2,\ldots ,v_{|\varPhi _u|}\}\). Note that, the vertex ids are not consecutive, i.e., it is possible that \(v_2 - v_1 > 1\). We represent the \(k-1\) cut points by \(k-1\) index values \(\{\ell _1,\cdots ,\ell _{k-1}\}\) such that \(1< \ell _1< \ell _2< \cdots< \ell _{k-1} < |\varPhi _u|\), i.e., the i-th cut point is between vertex \(v_{\ell _i-1}\) and vertex \(v_{\ell _i}\). Define \(\ell _0 = 1\) and \(\ell _k = |\varPhi _u|+1\). Then, segment \({\textsf{seg}} _i\) covers vertices \(\{v_{\ell _{i-1}},\ldots ,v_{\ell _i-1}\}\), for \(1 \le i \le k\). Let \(f_1\) be the number of fake vertices if we cover \(\varPhi _u\) by only one segment, i.e., \(f_1 = v_{|\varPhi _u|}-v_1 + 1 - |\varPhi _u|\) where \(v_{|\varPhi _u|}-v_1 + 1\) is the total number of vertices in the range \([v_1,v_{|\varPhi _u|}]\). It is easy to verify that the number of fake vertices of covering \(\varPhi _u\) by k segments with the \(k-1\) cut points \(\{\ell _1,\ldots ,\ell _{k-1}\}\) is \(f_1 - \sum _{i=1}^{k-1} (v_{\ell _i} - v_{\ell _i-1} + 1)\). As \(f_1\) is a fixed number for \(\varPhi _u\), minimizing the number of fake vertices is equivalent to maximizing \(\sum _{i=1}^{k-1} (v_{\ell _i} - v_{\ell _i-1} + 1)\).

Definition 5

(Gap) Given the sequence of vertices \(\{v_1,v_2,\ldots ,v_{|\varPhi _u|}\}\) of \(\varPhi _u\) that are ordered in increasing vertex id order, the gap of vertex \(v_i\) for \(i > 1\) is defined as \(v_i-v_{i-1}+1\); the gap of \(v_1\) is defined as 0.

Thus, the LG index constructs k segments to cover \(\varPhi _u\), where the \(k-1\) cut points are the \(k-1\) vertices with the largest gaps.

Fig. 5
figure 5

Example of \(\textsf{consLG}\)

Example 4

Figure 5 shows that the three segments constructed by the largest gap strategy for the same \(\varPhi _u\) as Example 3. The two vertices with the largest gap are \(v_{100}\) and \(v_{50}\).

Choosing the number of segments to cover \(\varPhi _u\). It is easy to see that the more the number of segments, the fewer the number of fake vertices introduced by the segments. In the extreme case of covering \(\varPhi _u\) by \(|\varPhi _u|\) segments, there will be no fake vertices introduced. However, the space complexity would be too high to be practical as discussed in Sect. 5.1. Thus, we set the number of segments for covering \(\varPhi _u\) as \(\alpha \cdot \log |\varPhi _u|\) where \(\alpha \) is a user defined parameter, in viewing that a fixed number for different \(\varPhi _u\) will not work as \(|\varPhi _u|\) varies a lot across different vertices u.

Steady Segment (SS) Index. The LG index ignores the similarities (between u and different vertices) in a segment and thus may result in a very wide range of similarity values for a segment. This is not good for \(\textsf{indexedSN}\) and \(\textsf{progressiveSN}\), as they need to check all vertices covered by a segment \({\textsf{seg}} \) even if there is only one vertex in \({\textsf{seg}} \) whose similarity to u is no lower than \(\varepsilon \). Motivated by this, we aim to construct steady segments such that all similarities in a segment are close to each other.

Definition 6

(Steady Segment) Given a steady threshold \(0< \gamma < 1\), a segment \({\textsf{seg}} = \langle {\mathsf {v_{\min }}},{\mathsf {v_{\max }}},{\mathsf {s_{\max }}},c\rangle \) of \(\varPhi _u\) is steady if

$$\begin{aligned} \max _{v \in V({\textsf{seg}})} {\textsf{sim}} (u,v) - \min _{v\in V({\textsf{seg}})} {\textsf{sim}} (u,v) \le \gamma \end{aligned}$$

The first term, \(\max _{v \in V({\textsf{seg}})} {\textsf{sim}} (u,v)\), is exactly \({\textsf{seg}}.{\mathsf {s_{\max }}} \). For ease of presentation, we denote the second term, \(\min _{v\in V({\textsf{seg}})} {\textsf{sim}} (u,v)\), by \({\textsf{seg}}.{\mathsf {s_{\min }}} \), the smallest similarity value. A segment \({\textsf{seg}} \) is steady if \({\textsf{seg}}.{\mathsf {s_{\max }}}- {\textsf{seg}}.{\mathsf {s_{\min }}} \le \gamma \). The main advantage of a steady segment is that if \({\textsf{seg}} \) is steady and satisfies \({\textsf{seg}}.{\mathsf {s_{\max }}} \ge \varepsilon \), then it is likely that many vertices of \(V({\textsf{seg}})\) have similarity values to u no lower than \(\varepsilon \), and thus, most of the computation will not be wasted.

Ideally, we would like to find the minimum number of steady segments to cover \(\varPhi _u\). However, the number of steady segments required could be very large. For example, if the steady threshold \(\gamma \) is very close to 0 and all vertices of \(\varPhi _u\) have different similarity values to u, then the number of required steady segments to cover \(\varPhi _u\) is \(|\varPhi _u|\). Thus, we instead construct a fixed number of steady segments to cover as many vertices of \(\varPhi _u\) as possible and then cover the remaining uncovered vertices of \(\varPhi _u\) by as few segments as possible by ignoring the difference between the similarity values.

Given \(\gamma \) and k, our problem is to find k steady segments to cover as many vertices of \(\varPhi _u\) as possible. We first construct, for each vertex \(v \in \varPhi _u\), a maximal steady segment \({\textsf{seg}} _v\) that starts at v (i.e., \({\textsf{seg}} _v.{\mathsf {v_{\min }}} = v\)), and then select k segments \(\mathbb {S}^*\) from \(\{{\textsf{seg}} _v \mid v\in \varPhi _u\}\) such that \(\left|\bigcup _{v \in \mathbb {S}^*} V({\textsf{seg}} _v)\right|\) is maximized. This is an instance of the maximum k-coverage problem which is NP-hard [40]. We select the k segments in a greedy manner. That is, the k segments are selected one-by-one. Let S be the starting vertices of the currently selected segments. Then, the next segment to be added to S is \(\textstyle \arg \max _{v \in \varPhi _u} \left|\bigcup _{v' \in S\cup \{v\}} V({\textsf{seg}} _{v'}) \right|. \) As this function is submodular, the greedy approach achieves an approximation ratio of \(1-\frac{1}{e}\) [17].

Algorithm 6
figure f

\({\textsf{consSS}} (G=(V_L,V_R,E), \alpha , \gamma )\)

The pseudocode is shown in Algorithm 6, denoted \(\textsf{consSS}\). For each vertex u, we first compute its 2-hop structural neighbors \(\varPhi _u\) and their similarities to u (Line 2) and then invoke \(\textsf{MaximalSteadySegments}\) to compute maximal steady segments that start at each vertex \(v_i \in \varPhi _u\) (Line 3). In \(\textsf{MaximalSteadySegments}\), the maximal steady segment \({\textsf{seg}} _{v_i}\) that starts at \(v_i\) is computed by iteratively trying to add the next vertex to the segment (Lines 19–27). Next, we iteratively add to \(\mathbb {S}_u\) the segment of \(\mathbb {C}\) that covers the largest number of uncovered vertices of \(\varPhi _u\) (Lines 4–12); as a result, after adding a segment into \(\mathbb {S}_u\), we also need to update the remaining segments of \(\mathbb {C}\) to be disjoint from the segments of \(\mathbb {S}_u\) (Lines 8–12). During this process, for time efficiency consideration, we do not maintain \({\textsf{seg}}.{\mathsf {s_{\max }}} \); instead, we compute \({\textsf{seg}}.{\mathsf {s_{\max }}} \) for each segment \({\textsf{seg}} \in \mathbb {S}_u\) later (Line 13). Finally, we create the minimum number of segments to cover all vertices of \(\varPhi _u\) that are not covered by \(\mathbb {S}_u\) (Lines 14–15).

Fig. 6
figure 6

Example of \(\textsf{consSS}\)

Example 5

Figure 6 shows the three steady segments constructed for the same \(\varPhi _u\) in Examples 3 and 4, where \(\gamma =0.1\).

Time complexity of \(\textsf{consSS}\). For each vertex \(u \in V_L\cup V_R\), Line 2 of Algorithm 6 takes \(O(\sum _{v \in N(u)} d(v))\) time, and Line 3 takes \(O(|\varPhi _u|^2)\) time. The while loop at Line 5 runs for at most \(\alpha \cdot \log |\varPhi _u|\) iterations, and each iteration takes \(O(|\varPhi _u|)\) time. Lines 13–15 take \(O(|\varPhi _u|)\) time. Thus, the total time complexity of \(\textsf{consSS}\) is

$$\begin{aligned} O\Big (\sum _{u\in V_L\cup V_R} \big (|\varPhi _u|^2 + \sum _{v\in N(u)} d(v) \big )\Big ) \end{aligned}$$

Computing all maximal steady segments in linear time. From the above time complexity analysis of \(\textsf{consSS}\), we can see that the time complexity of computing the initial maximal steady segments for all vertices dominates the total time complexity of \(\textsf{consSS}\). In view of this, we propose an efficient algorithm to compute the maximal steady segments for all vertices in linear time, i.e., \(O(|\varPhi _u|)\). The general idea is based on the following lemma.

Lemma 5

Let \(\{v_1,v_2,\ldots ,v_{|\varPhi _u|}\}\) be the vertices of \(\varPhi _u\) sorted in increasing vertex id order. For any \(1 \le i < |\varPhi _u|\), the largest vertex id in the maximal steady segment of \(v_{i+1}\) is no less than that of \(v_i\), i.e., \({\textsf{seg}} _{v_{i+1}}.{\mathsf {v_{\max }}} \ge {\textsf{seg}} _{v_i}.{\mathsf {v_{\max }}} \).

Proof

This is easy to see since the segment covering vertices from \(v_{i+1}\) to \({\textsf{seg}} _{v_i}.{\mathsf {v_{\max }}} \), which is a subset of \(V({\textsf{seg}} _{v_i})\), must be steady. \(\square \)

Fig. 7
figure 7

Computing maximal steady segments (\(\gamma =0.3\))

Following Lemma 5, we use two pointers, i and j, to compute the maximal steady segments for all vertices of \(\varPhi _u\). Here, i is the index of the start vertex and j is 1 plus the index of \({\textsf{seg}} _{v_i}.{\mathsf {v_{\max }}} \) in \(\varPhi _u\). We will increase i from 1 to \(|\varPhi _u|\); according to Lemma 5, j will not decrease. For initialization, both i and j are set as 1. To compute \({\textsf{seg}} _{v_i}\), we iteratively increase j by 1 until either \(j = |\varPhi _u|+1\) or the segment covering vertices \(\{v_i,\ldots ,v_j\}\) is not steady; when this process finishes, we set \({\textsf{seg}} _{v_i}.{\mathsf {v_{\max }}} = v_{j-1}\). After computing \({\textsf{seg}} _{v_i}\), we continue to compute the maximal steady segment for the next vertex by increasing i by 1. Note that we do not reset j after increasing i; this is correct according to Lemma 5. For example, for \(\varPhi _u\) in Fig. 7, when \(i=1\), we have \(j=3\) and thus \({\textsf{seg}} _{v_1}.{\mathsf {v_{\max }}} = v_{j-1} = v_2\); when \(i=2\), we have \(j=5\) and thus \({\textsf{seg}} _{v_2}.{\mathsf {v_{\max }}} = v_4\). Since both i and j are non-decreasing, the time complexity would be \(c\cdot |\varPhi _u|\) where c is the time complexity of checking whether the segment covering a given set of consecutive vertices is steady or not.

To check whether the segment covering a given set S of consecutive vertices of \(\varPhi _u\) is steady or not, all we need to obtain are \(\min _{v \in S} {\textsf{sim}} (u,v)\) and \(\max _{v \in S} {\textsf{sim}} (u,v)\). Since S consists of consecutive vertices of \(\varPhi _u\), we could use some advanced data structures (e.g., segment tree or the one presented in our conference version [52]) to obtain \(\min _{v \in S} {\textsf{sim}} (u,v)\) and \(\max _{v \in S} {\textsf{sim}} (u,v)\) in \(\log |\varPhi _u|\) time. In this work, we propose to use a simple data structure (specifically, a queue) to achieve this in amortized constant time. We focus our discussion on how to obtain \(\min _{v \in S} {\textsf{sim}} (u,v)\); note that \(\max _{v \in S} {\textsf{sim}} (u,v)\) can be obtained in a similar way. Our data structure stores vertices of S that are not min-dominated by other vertices of S.

Definition 7

(Min-Domination) Given a vertex u and a vertex subset \(S \subseteq \varPhi _u\), we say that \(v \in S\) min-dominates \(v' \in S\) if (1) v has a larger vertex id than \(v'\) (i.e., \(v > v'\)) and \({\textsf{sim}} (u,v) \le {\textsf{sim}} (u,v')\).

It is easy to see that \(\min _{v'\in S} {\textsf{sim}} (u,v')\) is equal to \({\textsf{sim}} (u,v)\) for v being the smallest vertex of S that is not min-dominated by other vertices of S. We store in a queue \(\mathcal {Q}_{\min }\) the list of vertices of S that are not min-dominated by other vertices of S; note that storing all the vertices (instead of just the smallest such vertex) that are not min-dominated will make our maintaining of \(\mathcal {Q}_{\min }\) easy. As vertices in \(\mathcal {Q}_{\min }\) are stored in increasing id order, their similarities to u are also strictly increasing. Suppose \(S = \{v_i,\ldots ,v_j\}\) and \(\mathcal {Q}_{\min }\) currently stores the non-min-dominated vertices of S. To maintain \(\mathcal {Q}_{\min }\) for \(S' = \{v_{i+1},\ldots ,v_{j'}\}\) where \(j' \ge j\), we only need to

  • remove \(v_i\) from \(\mathcal {Q}_{\min }\) if \(v_i \in \mathcal {Q}_{\min }\),

  • remove the longest suffix of \(\mathcal {Q}_{\min }\) that is min-dominated by a vertex of \(\{v_{j+1},\ldots ,v_{j'}\}\), and

  • add all vertices of \(\{v_{j+1},\ldots ,v_{j'}\}\) that are not min-dominated by any vertex of \(S'\).

We note that all these operations can be conducted in amortized constant time, which will be discussed shortly.

Algorithm 7
figure g

\({\textsf{TPA}} (G, u, \varPhi _u, \gamma )\)

The pseudocode of our linear-time algorithm for computing all maximal steady segments is shown in Algorithm 7. Same as \(\textsf{MaximalSteadySegments}\) in Algorithm 6, we first sort vertices of \(\varPhi _u\) into increasing vertex id order (Line 2); let \(\{v_1,v_2,\ldots ,v_{|\varPhi _u|}\}\) be the vertices in sorted order. We initialize the two pointers i and j to both be 1 (Line 3) and initialize two queues \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\max }\) (used for obtaining the minimum and the maximum similarity in a segment, respectively) by \(v_1\) (Line 4). Then, we iteratively compute the maximal steady segment for \(v_i, 1 \le i \le |\varPhi _u|\) (Lines 5–19). To compute the maximal steady segment for \(v_i\), we first pop the front element from \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\min }\) if it is smaller than \(v_i\) (Lines 6–7), and then keep increasing j if the segment covering vertices \(\{v_i,\ldots ,v_j\}\) is still steady (Lines 8–16). The testing of whether the segment covering vertices \(\{v_i,\ldots ,v_j\}\) is still steady or not is achieved by comparing \({\textsf{sim}} (u, \mathcal {Q}_{\max }.front()) - {\textsf{sim}} (u,\mathcal {Q}_{\min }.front())\) with \(\gamma \) (Line 8); as discussed above, \(\mathcal {Q}_{\min }.front()\) stores the vertex that has the lowest similarity to u among vertices of \(\{v_i,\ldots ,v_j\}\) while \(\mathcal {Q}_{\max }.front()\) stores the one with the highest similarity. Note that we consistently have \(j\ge i\) throughout the entire computation since the segment covering one vertex is trivially steady. After increasing j (Line 9) and j still being no larger than \(|\varPhi _u|\), we update \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\max }\) to store all the non-min-dominated and non-max-dominated vertices of \(\{v_i,\ldots ,v_j\}\), respectively (Lines 11–16). Specifically, to update \(\mathcal {Q}_{\min }\), we first remove from it all vertices that are min-dominated by \(v_j\) (Lines 11-12), and then push \(v_j\) to the back of \(\mathcal {Q}_{\min }\) (Line 13); \(\mathcal {Q}_{\max }\) is updated similarly. Once we reach a j such that either \(j > |\varPhi _u|\) (Line 10) or the segment covering vertices \(\{v_i,\ldots ,v_j\}\) is no longer steady, we set the maximal steady segment \({\textsf{seg}} _{v_i}\) of \(v_i\) to be \(\langle v_i,v_{j-1},{\textsf{null}},j-i\rangle \) (Line 17) and add it to \(\mathbb {C}\) (Line 18).

Lines 5–19 of Algorithm 7 run in \(O(|\varPhi _u|)\) time, since (1) j will increase for \(|\varPhi _u|\) times and (2) each vertex of \(\varPhi _u\) will be inserted into and popped from \(\mathcal {Q}_{\min }\) (resp. \(\mathcal {Q}_{\max }\)) at most once. Note that, in our implementation, we also store the similarities into \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\max }\) such that the similarity between u and any vertex at the front or back of \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\max }\) can be retrieved in constant time. Thus, the total time complexity of Algorithm 7 is dominated by Line 2 which takes \(O(|\varPhi _u| \log |\varPhi _u|)\) time. By invoking Algorithm 7 at Line 3 of Algorithm 6, the time complexity of Algorithm 6 becomes \(O\Big (\sum _{u \in V_L \cup V_R} \big (\alpha |\varPhi _u| \log |\varPhi _u| + \sum _{v \in N(u)} d(v)\big ) \Big )\).

Example 6

Let’s consider \(\varPhi _u\) in Fig. 7 and compute the maximal steady segments for its vertices with \(\gamma =0.3\); note that \((v_1,\ldots ,v_6) = (1,2,3,8,9,10)\). We initialize both i and j as 1, and initialize both \(\mathcal {Q}_{\min }\) and \(\mathcal {Q}_{\max }\) as \(\{v_1\}\). To compute the maximal steady segments for \(v_1\), we keep increasing j until the segment covering vertices \(\{v_i, \ldots , v_j\}\) is no longer steady. When j is increased to 2, we update \(\mathcal {Q}_{\min }\) as \(\{v_2\}\) (note that \(v_1\) is min-dominated by \(v_2\) and thus removed from \(\mathcal {Q}_{\min }\)) and update \(\mathcal {Q}_{\max }\) as \(\{v_1,v_2\}\); the segment covering \(\{v_1,v_2\}\) is steady as \({\textsf{sim}} (u,\mathcal {Q}_{\max }.front()) - {\textsf{sim}} (u,\mathcal {Q}_{\min }.front()) = 0.1 < \gamma \). When \(j=3\), we update \(\mathcal {Q}_{\min }\) as \(\{v_3\}\) and update \(\mathcal {Q}_{\max }\) as \(\{v_1,v_2,v_3\}\), as shown in Fig. 7a. The segment covering \(\{v_1,v_2,v_3\}\) is no longer steady since \({\textsf{sim}} (u,\mathcal {Q}_{\max }.front()) - {\textsf{sim}} (u,\mathcal {Q}_{\min }.front()) = 0.4 > \gamma \), and thus, \({\textsf{seg}} _{v_1} = \langle v_1,v_2,{\textsf{null}},2\rangle \).

Next, we increase i to 2 for computing the maximal steady segment for \(v_2\); we also remove \(v_1\) from \(\mathcal {Q}_{\max }\) since \(v_1 < v_i\). Same as illustrated above, we keep increasing j until the segment covering \(\{v_2,\ldots ,v_j\}\) is no longer steady. When j is still 3, \(\mathcal {Q}_{\min } = \{v_3\}\) and \(\mathcal {Q}_{\max } = \{v_2,v_3\}\) and thus the segment is steady. When \(j=4\), \(\mathcal {Q}_{\min } = \{v_3,v_4\}\) and \(\mathcal {Q}_{\max } = \{v_2,v_4\}\) and the segment is still steady. When \(j=5\), \(\mathcal {Q}_{\min } = \{v_3,v_4,v_5\}\) and \(\mathcal {Q}_{\max } = \{v_5\}\) as shown in Fig. 7b, and the segment is no longer steady. Thus, \({\textsf{seg}} _{v_2} = \langle v_2, v_4, {\textsf{null}}, 3\rangle \).

5.4 Parallelization

Our index construction algorithms can be easily parallelized, since processing \(\varPhi _u\) is totally independent from processing \(\varPhi _{u'}\) for \(u \ne u'\). Thus, we propose to further speed up the index construction with shared-memory parallelization (i.e., handling different vertices concurrently with multiple threads). Specifically, we use OpenMP to parallelize the “for loop” at Line 1 of Algorithm 6 in our parallel implementation.

6 Index maintenance

In this section, we propose algorithms to dynamically maintain our index structures that are proposed in Sect. 5.3, as real-world graphs are consistently changing with old edges being removed and/or new edges being added. Note that the algorithms introduced in this section can be applied to both LG index and SS index.

Here, we mainly focus on edge deletion and insertion, by first discussing the case of edge deletion in Sect. 6.1 and then the case of edge insertion in Sect. 6.2. Note that vertex deletion (resp. insertion) can be handled as a sequence of edge deletion (resp. insertion) followed by (resp. following) isolated vertex deletion (resp. insertion). When removing a vertex from a graph, we preserve its id and do not assign the id to other vertices. When inserting a new vertex into the graph, its id is set as 1 plus the current largest id.

6.1 Edge deletion

Let (uv) be the edge that is removed from G. Let \(\varPhi _u\) and \(\varPhi _v\) be the sets of 2-hop neighbors of u and v, respectively, in the graph G before removing the edge (uv). After removing edge (uv), among \(\{N(w) \mid w \in V_L\cup V_R\}\), only N(u) and N(v) change. Thus, only similarities in \(\{{\textsf{sim}} (u,w) \mid w \in \varPhi _u\}\cup \{{\textsf{sim}} (v,w') \mid w' \in \varPhi _v\}\) will change their value. For presentation simplicity, we assume that \(\varPhi _w\) does not change for any \(w \in V_L\cup V_R\); that is, even if the similarity \({\textsf{sim}} (w,w')\) decreases from a positive value to 0, we still consider \(w'\) to be in \(\varPhi _w\). Thus, we have the following lemma.

Lemma 6

When an edge (uv) is removed from G, only the part \(\{\mathbb {S}_w \mid w \in \varPhi _u\cup \varPhi _v\cup \{u,v\}\}\) of the index structure will change.

In the following, we only discuss the index maintenance for \(\mathbb {S}_u\) and \(\mathbb {S}_w\) where \(w \in \varPhi _u\); note that \(\mathbb {S}_v\) and \(\mathbb {S}_{w'}\) for \(w' \in \varPhi _v\) can be maintained similarly. For any vertex \(w\in \varPhi _u\), we also have \(u \in \varPhi _w\) and thus there must exist a segment \({\textsf{seg}} \in \mathbb {S}_w\) that covers u before the update, i.e., \({\textsf{seg}}.{\mathsf {v_{\min }}} \le u\le {\textsf{seg}}.{\mathsf {v_{\max }}} \). Consequently, we only need to update \({\textsf{seg}}.{\mathsf {s_{\max }}} \) as \(\max \{{\textsf{sim}} (u,w),{\textsf{seg}}.{\mathsf {s_{\max }}} \}\), where \({\textsf{sim}} (u,w)\) is the updated similarity between u and w. Note that if \({\textsf{sim}} (u,w) < {\textsf{seg}}.{\mathsf {s_{\max }}} \), it is also possible to decrease \({\textsf{seg}}.{\mathsf {s_{\max }}} \); but for updating efficiency consideration, we do not decrease \({\textsf{seg}}.{\mathsf {s_{\max }}} \). For the same reason, we also do not update (i.e., decrease) \({\textsf{seg}}.c\). For vertex u, \(\mathbb {S}_u\) can be updated in a similar way; that is, for each vertex \(w\in \varPhi _u\), we find the segment \({\textsf{seg}} ' \in \mathbb {S}_u\) that covers w and update \({\textsf{seg}} '.{\mathsf {s_{\max }}} \) as \(\max \{{\textsf{sim}} (u,w),{\textsf{seg}} '.{\mathsf {s_{\max }}} \}\). Note that, we only need to update one segment for \(\mathbb {S}_w\), but we may need to update multiple segments for \(\mathbb {S}_u\).

Optimization. In the above process, if \({\textsf{sim}} (u,w)\) becomes smaller after the update, then we actually do not update the segment that covers u or w. Thus, we can skip all such \(w \in \varPhi _u\) that \({\textsf{sim}} (u,w)\) is smaller after removing the edge (uv). Fortunately, this set of vertices can be easily characterized by the following lemma.

Lemma 7

After removing edge (uv) from the graph, the similarity \({\textsf{sim}} (u,w)\) for \(w \in \varPhi _u\) decreases if and only if \((w,v)\in E\), and it increases otherwise.

Proof

Let d(u) and d(w) be the degrees of u and w before the update, and C be the number of common neighbors of u and w before the update. Then, \({\textsf{sim}} (u,w) = \frac{C}{d(u)+d(w)-C}\) before the update. After removing (uv), d(w) remains unchanged while d(u) decreases by 1. We consider two cases:

  • If \((w,v)\in E\), the number of common neighbors between u and w will decrease by 1. Thus, the similarity between u and w after the update becomes \({\textsf{sim}} (u,w)=\frac{C-1}{d(u)+d(w)-C}\), which is smaller than their similarity before the update.

  • If \((w,v)\notin E\), the number of common neighbors between u and w will remain unchanged and \({\textsf{sim}} (u,w)=\frac{C}{d(u)+d(w)-C-1}\), which is larger than their similarity before the update.

This completes the proof. \(\square \)

Algorithm 8
figure h

\({\mathsf {EdgeDelete^*}} (G, u, v, \mathcal {I})\)

Following Lemma 7, when we process a 2-hop neighbor \(w\in \varPhi _u\) of u, we first check if \((w,v)\in E\). If \((w,v)\in E\), we know that the similarity between u and w must be decreasing and thus do nothing. The pseudocode of our algorithm for handling edge deletion is shown in Algorithm 8. We first invoke \({\textsf{UpdateDeletion}} \) to update \(\{\mathbb {S}_w \mid w \in \varPhi _u \cup \{u\}\}\) (Line 1), then invoke \({\textsf{UpdateDeletion}} \) again (but with different arguments) to update \(\{\mathbb {S}_w \mid w \in \varPhi _v \cup \{v\}\}\), and finally remove the edge (uv) from the graph. The pseudocode of the procedure \(\textsf{UpdateDeletion}\) is shown in Lines 4–11 of Algorithm 8. We first invoke a variant of \({\textsf{SimNei}} \) to compute u’s old 2-hop neighbors (i.e., before removing (uv)) and to compute the new (i.e., updated) similarity between u and each \(w \in \varPhi _u\) (Line 4); this can be done in a similar way to Algorithm 3, and we omit the details. Then, for each vertex \(w\in \varPhi _u\), if \((w,v)\in E\), we know from Lemma 7 that \({\textsf{sim}} (w,u)\) must be decreasing and we skip w (Lines 6–7); otherwise, we update the index \(\mathbb {S}_w\) and index \(\mathbb {S}_u\) (Lines 8–11). Specifically, we search for the segment in \(\mathbb {S}_w\) that covers u and update its \({\mathsf {s_{\max }}} \) (Lines 8–9). Then, we search for the segment in \(\mathbb {S}_u\) that covers w and update its \({\mathsf {s_{\max }}} \) (Lines 10–11).

6.2 Edge insertion

Let (uv) be the edge that is inserted into G. Let \(\varPhi _u\) and \(\varPhi _v\) be the sets of 2-hop neighbors of u and v, respectively, in the graph G after adding the edge (uv). After adding edge (uv), among \(\{N(w) \mid w \in V_L\cup V_R\}\), only N(u) and N(v) change. Thus, only similarities in \(\{{\textsf{sim}} (u,w) \mid w \in \varPhi _u\}\cup \{{\textsf{sim}} (v,w') \mid w' \in \varPhi _v\}\) will change their value. Thus, we have the following lemma.

Lemma 8

When an edge (uv) is inserted into G, only the part \(\{\mathbb {S}_w \mid w \in \varPhi _u\cup \varPhi _v\cup \{u,v\}\}\) of the index structure will change.

In the following, we only discuss the index maintenance for \(\mathbb {S}_u\) and \(\mathbb {S}_w\) where \(w \in \varPhi _u\); \(\mathbb {S}_v\) and \(\mathbb {S}_{w'}\) for \(w'\in \varPhi _v\) can be maintained similarly. To update \(\mathbb {S}_w\) for \(w \in \varPhi _u\), we first try to find the segment in \(\mathbb {S}_w\) that covers u (i.e., \({\textsf{seg}} \in \mathbb {S}_w\) s.t. \({\textsf{seg}}.{\mathsf {v_{\min }}} \le u\le {\textsf{seg}}.{\mathsf {v_{\max }}} \)). If there is such a segment \({\textsf{seg}} \), then we update \({\textsf{seg}}.{\mathsf {s_{\max }}} \) as \(\max \{{\textsf{sim}} (u,w),{\textsf{seg}}.{\mathsf {s_{\max }}} \}\) and increase \({\textsf{seg}}.c\) by 1 if the only common neighbor of u and w is v. Otherwise, there is no segment of \(\mathbb {S}_w\) covering u (this means that u is a new 2-hop neighbor of w), then we create a new segment for u and add it into \(\mathbb {S}_w\). The updating of \(\mathbb {S}_u\) is conducted in a similar way.

Optimization. Similar to the optimization in Sect. 6.1, we can skip all such \(w \in \varPhi _u\) that \({\textsf{sim}} (u,w)\) becomes smaller after adding the edge (uv). This set of vertices is characterized by the following lemma.

Lemma 9

After adding edge (uv) into the graph, the similarity \({\textsf{sim}} (u,w)\) for \(w \in \varPhi _u\) increases if and only if \((w,v)\in E\), and it decreases otherwise.

Proof

This can be proved in a similar way to the proof of Lemma 7. \(\square \)

Algorithm 9
figure i

\({\mathsf {EdgeInsert\text {-}Opt}} (G, u, v, \mathcal {I})\)

Following Lemma 9, we only need to update the index for \(w \in \varPhi _u\) such that \((w,v) \in E\). Thus, we can design an algorithm \(\mathsf {EdgeInsert^*}\) in a similar way to \(\mathsf {EdgeDelete^*}\) (Algorithm 8); that is, similar to Lines 6–7 of Algorithm 8, we skip the processing of w if \((w,v) \notin E\). We omit the pseudocode of \(\mathsf {EdgeInsert^*}\). This can significantly improve the updating time. Nevertheless, we can do even better, by noting that \(\{w \in \varPhi _u \mid (w,v) \in E\}\) is the same as N(v). Thus, instead of first generating \(\varPhi _u\) and then skipping non-neighbors of v from \(\varPhi _u\), we only process neighbors of v. The pseudocode of our further optimized algorithm is shown in Algorithm 9, denoted \(\mathsf {EdgeInsert\text {-}Opt}\). We first insert the edge (uv) into the graph, then invoke \({\textsf{UpdateInsertion}} \) to update \(\{\mathbb {S}_w \mid w \in \varPhi _u\cup \{u\}\}\), and finally invoke \({\textsf{UpdateInsertion}} \) again (but with different arguments) to update \(\{\mathbb {S}_w \mid w \in \varPhi _v\cup \{v\}\}\). The pseudocode of the procedure \({\textsf{UpdateInsertion}} \) is shown in Lines 4–17 of Algorithm 9. In \(\textsf{UpdateInsertion}\), for each vertex \(w\in N(v)\) (Line 4), we first compute \({\textsf{sim}} (u,w)\) (Line 5) by computing the number of common neighbors of u and w, denoted \({\textsf{ComNei}} (u,w)\), and then update the index \(\mathbb {S}_w\) (Lines 6–11) and the index \(\mathbb {S}_u\) (Lines 12–17). Specifically, to update \(\mathbb {S}_w\) by u and \({\textsf{sim}} (u,w)\), we search for the segment \({\textsf{seg}} _i\) in \(\mathbb {S}_w\) that covers u and update \({\textsf{seg}} _i.{\mathsf {s_{\max }}} \) if necessary (Lines 6–8). If the number of common neighbors between u and w is 1, we know that u is w’s new 2-hop neighbor and we thus increase \({\textsf{seg}} _i.c\) by 1 (Line 8); note that, even if \(\mathbb {S}_w\) has a segment covering u, it is still possible that u is w’s new 2-hop neighbor because segments contain fake vertices. If \(\mathbb {S}_w\) has no segment covering u (this means that u is w’s new 2-hop neighbor), we create a new segment for u and add it to \(\mathbb {S}_w\) (Lines 9–11). Update \(\mathbb {S}_u\) by w and \({\textsf{sim}} (u,w)\) is conducted similarly.

Table 1 Statistics of graphs

\(\mathsf {EdgeInsert^*}\) vs \(\mathsf {EdgeInsert\text {-}Opt}\). Both \(\mathsf {EdgeInsert^*}\) and \(\mathsf {EdgeInsert\text {-}Opt}\) have their advantages and disadvantages. \(\mathsf {EdgeInsert^*}\) needs to construct the full set \(\varPhi _u\) of u’s 2-hop neighbors and compute the similarity between u and each vertex \(w \in \varPhi _u\). The set \(\varPhi _u\) could be large compared with N(v) that is processed by \(\mathsf {EdgeInsert\text {-}Opt}\). Nevertheless, the similarity between u and each \(w \in N(v)\) is potentially computed more efficiently in \(\mathsf {EdgeInsert^*}\), since it invokes \({\textsf{SimNei}} \) to compute all the similarities at the same time which enables computation sharing among the similarity computation of different vertices. In contrast, \(\mathsf {EdgeInsert\text {-}Opt}\) needs to conduct set intersection to compute \(N(u)\cap N(w)\) which is used in calculating \({\textsf{sim}} (u,w)\). Our empirical study shows that \(\mathsf {EdgeInsert\text {-}Opt}\) performs better in practice.

6.3 Index reconstruction

To achieve high updating efficiency, our index maintenance strategy discussed above only maintains a relaxed version of our index. For example, \({\textsf{seg}}.{\mathsf {s_{\max }}} \) is only guaranteed to be an upper bound (i.e., can be larger than) \(\max _{v \in V({\textsf{seg}})} {\textsf{sim}} (u,v)\); this is because we didn’t update \({\textsf{seg}}.{\mathsf {s_{\max }}} \) when the similarity between u and a vertex covered by \({\textsf{seg}} \) becomes smaller. Also, after inserting an edge (uv), when we cannot find a segment in \(\mathbb {S}_u\) that covers u’s new 2-hop neighbor w, we create a new segment for w and insert this segment into \(\mathbb {S}_u\). As a result, the effectiveness of using our index to speed up query processing may deteriorate when a large number of updates has occurred. In view of this, we propose to reconstruct the index from scratch when the number of updates exceeds a threshold. To achieve this, we record the number of updates that have occurred, including vertex insertion/deletion and edge insertion/deletion. We reconstruct the entire index structure from scratch once this number is larger than \(\zeta \times |E|\), where \(\zeta \) is the parameter to control the frequency of index rebuilding and |E| is the number of edges in the bipartite graph for which the index was last time reconstructed. Note that, when we reconstruct the index from scratch, we also reassign the vertex id to make sure that vertices of \(V_L\) take (integer) ids from \(\{1,2,\ldots ,|V_L|\}\), and vertices of \(V_R\) take ids from \(\{1+|V_L|,2+|V_L|,\ldots ,|V_R|+|V_L|\}\).

7 Experiments

In this section, we evaluate the efficiency of our algorithms as well as the effectiveness of our similar-biclique model.

Algorithms. We compare the following algorithms.

  • \(\textsf{ooMBEA}\): the state-of-the-art algorithm proposed in [8] for enumerating all maximal bicliques.

  • \(\textsf{MSBE}\): our Algorithm 1 equipped with all the optimizations in Sect. 4.2.

  • \(\mathsf {mat\text {-}MSBE}\): the materialized version of \(\textsf{MSBE}\), as discussed at the end of Sect. 4.2.

  • \(\mathsf {LG\text {-}MSBE}\) and \(\mathsf {SS\text {-}MSBE}\): our index-based algorithms that use the largest gap and steady segment index, respectively.

The source code of our algorithms can be found at https://github.com/kyaocs/Similar-Biclique-Idx. The source code of \(\textsf{ooMBEA}\) is obtained from the authors of [8].

All our algorithms are implemented in C++ and run in main memory. All experiments, except the one on evaluating our parallel index construction algorithm, are conducted on a machine with an Intel(R) 3.2GHz CPU and 64GB main memory running Ubuntu 18.04.5. We set a timeout of 10 hours for running an algorithm on a graph.

Datasets. We evaluate the algorithms on 17 real bipartite graphs, all of which are publicly available on KONECT .Footnote 5 Statistics of the graphs are shown in Table 1, where the graphs are listed in increasing order regarding the number of edges.

Query Parameters. A maximal similar-biclique enumeration query consists of two parameters, \(\varepsilon \) and \(\tau \). \(\varepsilon \) is chosen from \(\{0.4, 0.5, 0.6, 0.7\), \(0.8\}\), and is set as 0.5 by default. \(\tau \) is chosen from \(\{3, 4, 5, 6, 7\}\), and is set as 3 by default. In addition, we also have parameters \(\alpha \) and \(\gamma \) in index construction; we set \(\alpha = 1\) and \(\gamma = 0.3\) by default.

Fig. 8
figure 8

Running time on all graphs (\(\varepsilon =0.5\), \(\tau =3\))

7.1 Evaluating query efficiency

In this subsection, we evaluate the query efficiency of our algorithms. Note that, we also implemented a version of \(\textsf{MSBE}\) without the optimizations of \(\textsf{Enum}\) proposed in Sect. 4.2; it is omitted from the experiments since it times out in almost all the testings.

Running time on all graphs. The running time of the five algorithms on all graphs with default \(\varepsilon \) and \(\tau \) is illustrated in Fig. 8. We can see that \(\mathsf {mat\text {-}MSBE}\) slightly improves upon \(\textsf{MSBE}\) when it is feasible to store the similar neighbors of all vertices in main memory. However, \(\mathsf {mat\text {-}MSBE}\) remains notably slow, primarily attributed to the substantial computational cost involved in computing similar neighbors for L-side vertices and constructing the similarity graph \(G_s\). Additionally, \(\mathsf {mat\text {-}MSBE}\) encounters out-of-memory issues on BS, CU, and DI, as marked by “oom” in Fig. 8; for example, the memory consumption on BS would be over 400GB. Note that the memory consumption of \(\mathsf {mat\text {-}MSBE}\) mainly depends on the structure, rather than the size, of the input graph, and thus \(\mathsf {mat\text {-}MSBE}\) does not run out-of-memory on other larger graphs. Our two index-based algorithms, \(\mathsf {LG\text {-}MSBE}\) and \(\mathsf {SS\text {-}MSBE}\), are the fastest and they outperform the other two index-free algorithms by up to 5 orders of magnitude. \(\mathsf {SS\text {-}MSBE}\) is generally faster than \(\mathsf {LG\text {-}MSBE}\). Compared with the state-of-the-art maximal biclique enumeration algorithm \(\textsf{ooMBEA}\), \(\mathsf {SS\text {-}MSBE}\) is up to 6 orders of magnitude faster. Thus, we exclude \(\textsf{ooMBEA}\) from our remaining evaluations.

Fig. 9
figure 9

Running time by varying \(\varepsilon \) and \(\tau \)

Running time by varying \(\varepsilon \) and \(\tau \). The running time of our four algorithms on IM and FL by varying \(\varepsilon \) and \(\tau \) are shown in Fig. 9. We can see that the running time of \(\mathsf {LG\text {-}MSBE}\) and \(\mathsf {SS\text {-}MSBE}\) decreases when either \(\varepsilon \) or \(\tau \) increases. This is because, more vertices will be pruned by \(\textsf{indexedVR}\) when either \(\varepsilon \) or \(\tau \) increases, and thus the enumeration process of \(\mathsf {LG\text {-}MSBE}\) and \(\mathsf {SS\text {-}MSBE}\) run faster. Also, \(\textsf{indexedVR}\) runs faster when \(\varepsilon \) or \(\tau \) increases, as can be seen from Fig. 10. In contrast, the running time of \(\textsf{MSBE}\) and \(\mathsf {mat\text {-}MSBE}\) is not so sensitive to \(\varepsilon \) or \(\tau \), as the dominating part of these two algorithms is computing similar neighbors for vertices.

Fig. 10
figure 10

Efficiency of \(\textsf{indexedVR}\) (\(\tau =3\))

Table 2 Index size and construction time

Efficiency of \(\textsf{indexedVR}\). In this experiment, we evaluate the efficiency of \(\textsf{indexedVR}\) for our two index structures. Recall that \(\textsf{indexedVR}\) (Algorithm 5) has two phases. Thus, we separately report the results of each phase. The running time on IM and FL are shown in Fig. 10a and b. We can see that the two index structures take almost the same time for the first phase, while the second phase of SS index-based \(\textsf{indexedVR}\) is much faster than LG index-based. This can partially be explained by the number of vertices that need to be pruned in the second phase, as reported in Fig. 10c and d. We remark that, for a fixed \(\varepsilon \) and \(\tau \), the total number of pruned vertices by different indexes are the same, and also the same as that pruned by the index-free approach \(\textsf{VReduce}\). Thus, from the number of vertices that are pruned in Phase-II as shown in Fig. 10, we can conclude that SS index prunes much more vertices than LG index in Phase-I. For example, for dataset FL and \(\varepsilon =0.4\), SS index prunes 467, 329 vertices in Phase-I and 9, 691 vertices in Phase-II, while LG index prunes 449, 195 vertices in Phase-I and 27, 825 vertices in Phase-II. As the second phase dominates the running time, SS index is superior.

Evaluate the optimizations for \(\textsf{Enum}\). In this experiment, we evaluate the performance of our two optimization techniques (i.e., Lemma 1 and Lemma 2) on the \(\textsf{Enum}\) procedure. The results are reported in Fig. 11. Specifically, we use our fastest algorithm \(\mathsf {SS\text {-}MSBE}\) to evaluate these two optimization techniques. Note that \(\mathsf {SS\text {-}MSBE}\) applies both optimizations. For this testing, we also implement \(\mathsf {SS\text {-}MSBE\text {-}v1}\) that adopts neither of the two optimizations, \(\mathsf {SS\text {-}MSBE\text {-}v2}\) that adopts only Lemma 1, and \(\mathsf {SS\text {-}MSBE\text {-}v3}\) that adopts only Lemma 2. From Fig. 11, we can see that \(\mathsf {SS\text {-}MSBE\text {-}v1}\) cannot finish in a reasonable time due to lacking of these optimization techniques, and both \(\mathsf {SS\text {-}MSBE\text {-}v2}\) and \(\mathsf {SS\text {-}MSBE\text {-}v3}\) run significantly faster than \(\mathsf {SS\text {-}MSBE\text {-}v1}\) demonstrating that both Lemma 1 and Lemma 2 significantly improved the performance. We can also observe that the improvements of \(\mathsf {SS\text {-}MSBE}\) over \(\mathsf {SS\text {-}MSBE\text {-}v2}\) and over \(\mathsf {SS\text {-}MSBE\text {-}v3}\) are not that significant. This is because the additional improvements brought by Lemma 2 over Lemma 1 and brought by Lemma 1 over Lemma 2 are not that significant. Nevertheless, \(\mathsf {SS\text {-}MSBE}\) consistently runs faster than \(\mathsf {SS\text {-}MSBE\text {-}v2}\) and \(\mathsf {SS\text {-}MSBE\text {-}v3}\); thus we apply both Lemmas in our algorithm.

Fig. 11
figure 11

Evaluation of the optimizations for \(\textsf{Enum}\)

7.2 Evaluating indexing techniques

Index size and construction time on all graphs. The size of the two indexes on all graphs are shown in the fourth column and last column of Table 2. As a comparison, we also report the graph size in the second column of Table 2. We can see that in most cases, the sizes of the two indexes are similar to each other and are at the same level as the graph size, and thus they are affordable to be stored in main memory.

The running time of our index construction algorithms \(\textsf{consLG}\), \(\textsf{consSS}\), \(\mathsf {consSS^*}\) and \(\mathsf {consSS\text {-}Opt}\) are reported in the third, fifth, sixth and seventh columns of Table 2, respectively. Here, \(\mathsf {consSS^*}\) uses a similarity tree to compute all the maximal steady segments at Line 3 of Algorithm 6; the details can be found in our preliminary version [52]. \(\mathsf {consSS\text {-}Opt}\) invokes \(\textsf{TPA}\) (Algorithm 7) to compute all the maximal steady segments. We can see that \(\textsf{consLG}\) runs the fastest due to its simplicity. By applying our novel optimization technique, \(\mathsf {consSS\text {-}Opt}\) is much faster than \(\textsf{consSS}\) and \(\mathsf {consSS^*}\) and is only slightly slower than \(\textsf{consLG}\). We omit \(\mathsf {consSS^*}\) from all the remaining testings.

Fig. 12
figure 12

Index performance by varying \(\alpha \)

Index performance by varying \(\alpha \). In this experiment, we evaluate the effect of \(\alpha \) on the index size, index construction time and efficiency of \(\textsf{MSBE}\). The results are shown in Fig. 12. Recall that \(\alpha \) controls the number of segments constructed for \(\varPhi _u\). As expected, the index size and index construction time increase along with the increasing of \(\alpha \), as shown in Fig. 12a and b. When \(\alpha \) is no larger than 1, the index size is at most at the same level as the graph size, but when \(\alpha \) reaches 100, the index size can be much larger than the graph size. As shown in Fig. 12c and d, the running time of both \(\mathsf {LG\text {-}MSBE}\) and \(\mathsf {SS\text {-}MSBE}\) decreases when \(\alpha \) increases. This is because the more the number of segments, the fewer the number of fake vertices. To strike a balance between index size and efficiency of \(\textsf{MSBE}\), we recommend to set \(\alpha \in [0.1, 10]\).

Fig. 13
figure 13

Efficiency of \(\mathsf {SS\text {-}MSBE}\) by varying \(\gamma \)

Efficiency of \(\mathsf {SS\text {-}MSBE}\) by varying \(\gamma \). In this experiment, we evaluate the performance of \(\mathsf {SS\text {-}MSBE}\) for different \(\gamma \) values. Note that the index size and index construction time of \(\mathsf {consSS\text {-}Opt}\) are almost not affected by \(\gamma \); thus, we omit these results. This is because \(\mathsf {consSS\text {-}Opt}\) selects a fixed number of steady segments (i.e., \(\alpha \log |\varPhi _u|\)) to cover as many vertices of \(\varPhi _u\) as possible, and then, it covers all remaining uncovered vertices of \(\varPhi _u\) by using the fewest number of disjoint segments. Thus, the total number of segments generated for \(\varPhi _u\) is at most \(2\alpha \log |\varPhi _u|+1\), which is independent of \(\gamma \). Figure 13 shows the running time of \(\mathsf {SS\text {-}MSBE}\) by varying \(\gamma \) from 0.1 to 0.9. We can see that when \(\gamma \) is small (e.g., \(\gamma <0.3\)), the performance of \(\mathsf {SS\text {-}MSBE}\) is not good. The main reason is that when \(\gamma \) is small, a steady segment will cover fewer vertices due to the tighter constraint. As a result, more vertices need to be covered by the ordinary segments, which then results in introducing more fake vertices. Also, when \(\gamma \) is large, the performance of \(\mathsf {SS\text {-}MSBE}\) becomes worse. This is because for large \(\gamma \) (e.g., \(\gamma =1\)), a steady segment is no longer steady and degenerates to the ordinary segment. This motivates us to introduce steady segment. We recommend the value of \(\gamma \) to be in [0.3, 0.5].

Fig. 14
figure 14

Speeding up \(\mathsf {consSS\text {-}Opt}\) by using multiple CPU cores

Parallel index construction. In this experiment, we evaluate the parallelized version of our index construction algorithm \(\mathsf {consSS\text {-}Opt}\) by varying the number of CPU cores from 1 to 16 on our largest datasets DE and OR. This experiment was conducted on a machine with an Intel(R) 2.6GHz CPU and 16 cores. As shown in Fig. 14, \(\mathsf {consSS\text {-}Opt}\) has a near-linear speedup by using multiple CPU cores. For example, on OR, the running time of \(\mathsf {consSS\text {-}Opt}\) decreases from 21, 101 seconds (around 6 hours) to 2, 084 seconds (around 0.5 hour) when we increase the core number from 1 to 16. This demonstrates that the index construction can be easily parallelized, and our index can be constructed more efficiently with the help of multiple CPU cores which makes our index more attractive.

Fig. 15
figure 15

Running time of index maintenance

Efficiency of index maintenance. In this experiment, we evaluate the performance of our index maintenance algorithms. For edge deletion, we randomly delete 1, 000 edges from the graph one by one and report the average processing time. For comparison, we also implement a version of \(\mathsf {EdgeDelete^*}\) that does not performance the optimization (i.e., Lemma 7), denoted \(\textsf{EdgeDelete}\). The average processing time of \(\mathsf {EdgeDelete^*}\) and \(\textsf{EdgeDelete}\) for handling edge deletions is reported in Fig. 15a. We can see that the running time of \(\mathsf {EdgeDelete^*}\) is at millisecond-level. For example, when deleting an edge from DB, the maintenance of the index can be completed in less than 0.1 ms. However, the effectiveness of optimization technique is not obvious, i.e., \(\mathsf {EdgeDelete^*}\) performs similarly to \(\textsf{EdgeDelete}\). This mainly because the number of u’s 2-hop neighbors that are adjacent to v is much smaller than the total number of u’s 2-hop neighbors; thus, the pruning at Lines 6–7 of Algorithm 8 is not very effective.

For edge insertion, we randomly insert 1, 000 new edges into the graph one by one and report the average processing time. We compare \(\mathsf {EdgeInsert\text {-}Opt}\) with \(\mathsf {EdgeInsert^*}\) and its non-optimized version \(\textsf{EdgeInsert}\). The results are reported in Figure 16. We can see that \(\mathsf {EdgeInsert^*}\) runs significantly faster than \(\textsf{EdgeInsert}\) due to the optimization technique described in Lemma 9. Nevertheless, \(\mathsf {EdgeInsert\text {-}Opt}\) runs the fastest, since it only processes the neighbors of u and v for the newly inserted edge (uv).

Fig. 16
figure 16

Index size and query efficiency w.r.t the number of updates

Index size and query efficiency w.r.t. the number of updates. In this experiment, we evaluate the index size and query efficiency w.r.t. the number of updates. In our index maintenance algorithms, for time efficiency consideration, we do not update a segment if one or more similarities in the segment decrease, and we also do not enforce the steady requirement of a segment. As a result, it is expected that the query efficiency will deteriorate when the index has been updated by a lot of edge insertions and deletions. The results for edge deletion on IM and FL are shown in Fig. 16a and b, respectively; here, the query time is the running time of \(\mathsf {SS\text {-}MSBE}\) under the default setting (i.e., \(\varepsilon =0.5\) and \(\tau =3\)). We can see that the index size remains steady; this is because deleting edges will not bring new segments to our index. However, the query time increases when the number of updates is over \(10^5\). The results for edge insertion on IM and FL are shown in Fig. 16c and d, respectively. We can see that both the index size and query time increase significantly when the number of inserted edges is over \(10^5\). This is because new segments are also created for handling edge insertions. From this experiment, we recommend to reconstruct the index from scratch when the number of updates has reached \(0.1\times |E|\).

7.3 Effectiveness evaluations

Fig. 17
figure 17

Average Jaccard similarity

Average Jaccard similarity. We compare the average Jaccard similarity between L-side vertices in a maximal (similar-)biclique. Specifically, for each maximal (similar-)biclique C, we compute the average of the Jaccard similarity between all pairs of vertices from \(C_L\), and then the average result of all maximal (similar-)bicliques is reported in Fig. 17. We can see that vertices in a similar-biclique are much more similar to each other than in a biclique.

Fig. 18
figure 18

Number of maximal similar-bicliques

Number of maximal similar-bicliques. The number of maximal similar-bicliques for different \(\varepsilon \) and \(\tau \) values are shown in Fig. 18. We can see that the number of maximal similar-bicliques decreases with the increase of either \(\varepsilon \) or \(\tau \), which is as expected. This is because the number of similar neighbors for a vertex decreases with the increase of \(\varepsilon \). Thus, more and more similar-bicliques disappear under a larger \(\varepsilon \). It is worth mentioning that even under a high \(\varepsilon \) value of 0.7, there is still quite a few similar-bicliques, which exhibit a high level of consistence among members from the same side.

Fig. 19
figure 19

Case study 1: anomaly detection

Case study 1: anomaly detection. In this case study, we compare similar-biclique with other dense bipartite subgraph models, biclique, \((\alpha ,\beta )\)-core [24] and k-biplex [54], on anomaly detection in e-commerce applications. As mentioned in the Introduction, to improve the ranking of certain products, e-business owners may employ a set of fraudulent users to purchase a set of designated products. The fraudsters will also purchase other honest products trying to look “normal”; this is called “camouflage” in the literature. We consider a camouflage attack in the same way as [18] on “Amazon Review Data” (Magazine Subscriptions) ,Footnote 6 which contains 65, 546 reviews on 2, 316 magazines by 53, 617 users, by injecting 100 fraudulent users and 100 fraudulent products with various edge densities. The amount of camouflage (i.e., edges linking to honest products) added per fraudulent user is equal to the amount of fraudulent edges for that user. We adopt F-score, \(\frac{2\times \text {precision}\times \text {recall}}{\text {precision}+\text {recall}}\), to evaluate the accuracy of detecting suspicious users and products. We apply the size constraint \(\tau \) to all the models, where \(\alpha =\beta =\tau \) for the \((\alpha ,\beta )\)-core model; for our similar-biclique model, \(\varepsilon \) is set as 0.2. The results by varying \(\tau \) and varying the density of the injected subgraph are shown in Fig. 19. We can see that similar-biclique always achieves the highest accuracy. This is due to the similarity constraint imposed on users by similar-biclique, which naturally captures the reality that fraudulent users usually display a high level of synchronized behavior with each other. In contrast, biclique, 1-biplex, and \((\alpha ,\beta )\)-core all have a low precision and thus low F-score.

Fig. 20
figure 20

Case study 2: similar-bicliques in Unicode(\(\tau =2\))

Case study 2: interesting pattern detection on Unicode . We also conduct a case study on the Unicode dataset [25] to illustrate the hierarchical structure of similar-bicliques by varying the similarity threshold \(\varepsilon \). Unicode captures the languages that are spoken in a country. The three similar-bicliques detected for \(\varepsilon = 0.7\), 0.4, 0.01 are reported in Fig. 20, where the entire result corresponds to \(\varepsilon =0.01\); the similarity constraint is imposed on the countries and \(\tau =2\). We have the following observations. Firstly, the five countries in the similar-biclique for \(\varepsilon =0.7\) are all located in the Caribbean Sea Area with English and Spanish being their main language (around \(90\%\) population speak English and Spanish). Secondly, more countries from Latin America, e.g., Argentina and Chile, are included in the similar-biclique for \(\varepsilon =0.4\), and the newly added four countries speak more diverse languages. For example, in Sint Maarten, besides English and Spanish, around \(8\%\) population speak Virgin Islands Creole English and \(4\%\) population speak Dutch .Footnote 7 Lastly, when \(\varepsilon \) is 0.01, similar-biclique degenerates to biclique, and more countries are included, e.g., America and Germany. This demonstrates that similar-biclique can detect interesting patterns.

Table 3 Case study 3: similar-bicliques in DBLP

Case study 3: research group identification in DBLP . Our similar-biclique model also supports the case that not all vertices in a side share a common neighbor. In this case study, we show the similar-bicliques in a researcher-write-paper bipartite graph DBLP Footnote 8 by using different size constraints \(\tau _L\) and \(\tau _R\) on the two sides. The results for \(\varepsilon =0.6\), \(\tau _L=6\), and \(\tau _R=0\) are illustrated in Table 3; thus, researchers in a similar-biclique are not necessarily co-authors of the same paper. We find that the detected similar-bicliques corresponds to research groups in different institutes.

8 Conclusion

In this work, we formulated the notion of similar-biclique, and proposed algorithms as well as optimization techniques to enumerate all similar-bicliques in a bipartite graph. Besides, index structures are also designed to speed up the computation. We also proposed effective and efficient index construction algorithms by investigating two different strategies. In addition, we proposed index maintenance algorithms to handle dynamic graph updates. Extensive empirical studies on real bipartite graphs demonstrated the effectiveness of our similar-biclique model and the efficiency of our algorithms. Case studies show that the similar-biclique model can be used to detect anomalies as well as interesting dense subgraph patterns. Our work initiates the study of integrating similarity constraint into dense bipartite subgraph mining, by taking the biclique model. For future studies, it will be interesting to integrate similarity constraint into other dense bipartite subgraph models, such as quasi-biclique, k-biplex, \((\alpha ,\beta )\)-core, k-bitruss, and k-wing. We believe that our proposed index structures will also be useful for these extensions.