CN114528439B

CN114528439B - Method and device for enumerating maximum groups based on distributed system

Info

Publication number: CN114528439B
Application number: CN202011324463.5A
Authority: CN
Inventors: 潘敏佳; 李荣华; 田群; 戴永恒; 刘学谦
Original assignee: Diankeyun Beijing Technology Co ltd
Current assignee: Diankeyun Beijing Technology Co ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2024-06-14
Anticipated expiration: 2040-11-23
Also published as: CN114528439A

Abstract

The invention provides a method and a device for enumerating a maximum group based on a distributed system, wherein the method comprises the following steps: renumbering based on the attribute or the position of the vertex in the undirected unowned graph, so that the node numbers with similar attribute or position are adjacent; dividing the renumbered vertex set into a plurality of continuous blocks according to the number; allocating a plurality of consecutive blocks to a plurality of nodes, respectively, such that each node corresponds to a block; determining ID values of all vertexes in the undirected unbiased graph; information is transmitted between the node where the primary copy is located and the node where the corresponding mirror copy is located, so that each node obtains a first neighbor list of each primary copy based on the transmitted information, and the first neighbor list comprises a list of neighbors of the primary copy with ID larger than that of the primary copy; each node obtains an adjacency list of each primary copy vertex, and based on the obtained adjacency list information, the enumeration of the great group is performed by utilizing a search tree. The invention considers the information such as the locality of the vertex, reduces the information transmission and improves the searching efficiency.

Description

Method and device for enumerating maximum groups based on distributed system

Technical Field

The invention relates to the technical field of graph calculation, in particular to the technical field of graph maximum group enumeration, and particularly relates to a method and a device for maximum group enumeration based on a distributed system.

Background

In recent years, with the development of information technology, various big data are ubiquitous in practical applications, such as: social networks, web networks, biological networks, and the like. These various big data systems may be represented as graphs. For example, a Web network is a graph of Web pages with hyperlinks interconnected; people can be used as vertexes and their relations as edges in the social network diagram; the biochemical molecules can be used as vertices and the reactions between them as edges in the biological network diagram. Extracting implicit dense substructures from a graph of these networks is a fundamental problem in network analysis, for example: social circles are mined from social networks, key important websites are found in Web networks, protein complexes are found in biological networks, and so on. At present, many models are proposed in the field of network graph mining to extract dense subgraphs in a network, wherein a relatively classical model is a biggest group model, and the biggest group represents a biggest complete graph in the graph, that is, every two points need to be connected by an edge.

As the graphs to be processed at present become larger and more complex, the algorithms of the stand-alone system are insufficient to meet the current graph processing requirements. There are three main types of algorithms to deal with the problem of very large clique enumeration, the first is a linear memory algorithm, the second is a linear memory algorithm, and the third is a distributed parallel algorithm. Since the linear memory algorithm and the linear memory algorithm are relatively slow to calculate and are not suitable for large graphs, distributed parallel algorithms are mostly adopted at present for large graphs. Many existing distributed parallel algorithms require a complete graph to be stored on each machine, and may suffer from load imbalance due to differences in node degrees. For this situation Yanyan Xu et al propose a mapReduce based distributed algorithm in which a machine is required as master and the others are slaves in a mapReduce based distributed algorithm system. When the algorithm is executed, each vertex is first given an ID value that is different from each other, and the ID value may be defined in various manners, such as degrees or degeneracy (degenerate) order, etc. In any maximum clique, there must be a vertex with the smallest ID value in the clique, on which the algorithm is based, searching each vertex for the maximum clique with its smallest ID value.

This algorithm is split into two phases altogether. In the first stage, a data distribution stage (map stage) is performed, which processes the ID of the vertex and the adjacency list information in order to obtain vertex v and those vertex information in which the ID value in the adjacency of vertex v is larger than vertex v. The master node distributes tasks to the slave nodes, and all the slave nodes write all the information obtained by the master node into a pre-designated intermediate file or a local disk respectively to wait for the next operation. In the second stage, searching (reduce stage) of the maximum clique is performed on the nodes, and the slave nodes first wait for the master node to distribute tasks, then remotely read the vertex information stored in the first stage respectively, and perform searching by using the vertex information. Specifically, a maximum clique is found by searching a tree with the ID of each vertex v, the adjacency list, and the neighbor vertices of the vertex whose ID is larger than the vertex v as initial information of the search. Wherein, each vertex prevents repeated searching by searching only nodes with larger IDs than the own, and at the same time, the algorithm records the searched vertexes which can be added into the current cluster through a certain data structure to judge whether the current cluster is extremely large or not.

During the operation of the whole algorithm, the master node is responsible for the dispatching of the whole algorithm, and the task of the vertex is simply divided into each slave node. The slave nodes cannot directly perform information interaction, the information interaction only occurs between the master node and the slave nodes, so that a master node is needed to schedule and manage resources, when the clusters of the processing algorithm are small, the resource waste is caused, and the maintenance cost of the system is greatly increased due to the existence of the master node.

In addition, the mapReduce-based distributed algorithm requires a large number of disk IO/remote copies, which is not beneficial to real-time calculation, and often brings huge communication cost, occupies more time in communication and has higher requirement on bandwidth. The execution efficiency of the algorithm needs to be further improved.

Disclosure of Invention

In view of the problems existing in the prior art, the embodiment of the invention provides a method and a device for performing great group enumeration based on a distributed system, so that the execution efficiency of great group enumeration is greatly improved, and the network overhead is reduced.

The technical scheme of the invention is as follows:

A method of supergroup enumeration based on a distributed system, the method comprising the steps of:

Numbering based on the attribute or the position of the vertex in the undirected unowned graph, so that the node numbers with similar attribute or position are adjacent;

Dividing the renumbered vertex set into a plurality of continuous numbering intervals according to the number;

assigning the vertex sets corresponding to the continuous intervals to the nodes so that each node corresponds to a numbered interval;

Determining ID values of all vertexes in the undirected unbiased graph;

In the process of graph calculation, information is transmitted between a node where a primary copy is located and a node where a corresponding mirror copy is located, so that each node obtains a first neighbor list of each primary copy based on the transmitted information, wherein the first neighbor list comprises a list of neighbors of which the ID is larger than the ID of the current primary copy;

Each node obtains an adjacency list of each primary copy vertex, and based on the obtained adjacency list information, the enumeration of the great group is performed by utilizing a search tree.

In an embodiment of the present invention, the information transfer is performed between a node where a primary copy is located and a node where a corresponding mirror copy is located, so that each node obtains a first neighbor list of each primary copy based on the transferred information, including: determining whether the current active edge set is in a sparse state or a dense state; under the condition that an active side set is in a sparse state, pushing main copy information to a node where the mirror copy is located by the node where the mirror copy is located, and updating a first neighbor list of the main copy on the node where the mirror copy is located by the node where the mirror copy is located based on the received main copy information; and under the condition that the active edge set is in a dense state, the node where the mirror image copy is located counts all first neighbor lists of the mirror image copy in the node, and transmits the generated neighbor list to the corresponding master copy.

In an embodiment of the present invention, information is transferred between a node where a master copy is located and a node where a corresponding mirror copy is located by using a multi-point interface MPI.

In an embodiment of the present invention, the information transfer between the node where the master copy is located and the node where the corresponding mirror copy is located by using the multi-point interface MPI includes: using special threads to transfer information between a node where a main copy is located and a node where a corresponding mirror copy is located by utilizing a multi-point interface MPI; each node obtains an adjacency list of each vertex, and based on the obtained adjacency list information, enumeration of a great group is performed by utilizing a search tree, and the method comprises the following steps: for any vertex in a node, if the node has obtained the complete information needed to enumerate that vertex, then a very large clique enumeration is performed with respect to that vertex.

In one embodiment of the present invention, the step of performing the enumeration of the biggest cliques by using the search tree based on the obtained adjacency list information includes:

For each master copy owned by each node, selecting a vertex from a candidate vertex set, wherein the candidate set comprises vertices which are not searched for by the current master copy, are neighbors of the current master copy, have an ID and are larger than the current master copy;

if the selected vertex has an edge connected vertex with each vertex in the current cluster being enumerated, adding the selected vertex to the current cluster being enumerated;

Performing intersection operation on the candidate vertex set and a vertex set which is not only the ID of the neighbor of the current main copy but also larger than the current main copy among the neighbors of the selected vertex, and adding the result of the intersection operation into the currently enumerated clique;

The second vertex set that has been searched for and that can be added to the cluster currently being enumerated is cross-manipulated with a vertex set that is both the neighbor of the current primary copy and larger than the current primary copy among the neighbors of the selected vertex, and the second vertex set is updated based on the results of the cross-manipulation.

In an embodiment of the present invention, the method further includes: pruning operations are performed using pivot(s), the steps comprising: selecting a vertex as a principal component; and removing the vertex set which is the neighbor of the current main copy and has the ID larger than the current main copy from the neighbor of the selected vertex in the candidate vertex set.

In an embodiment of the present invention, the selecting a vertex as the principal element includes: selecting the vertexes in the current candidate vertex set, wherein the vertexes in the candidate vertex set are not only neighbors of the current primary copy, but also vertex sets with ID larger than the current primary copy; and (3) greedy coloring is carried out on the selected set, and a vertex u corresponding to the intersection with the greatest coloring is selected as a principal element.

In an embodiment of the present invention, the method further includes: the recording of vertices is performed using bitmaps.

In another aspect of the present invention, there is also provided a device for cluster enumeration based on a distributed system, the device comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the device implementing the steps of the method as described above when the computer instructions are executed by the processor.

In another aspect of the invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

The maximum group enumeration method and system based on the distributed system fully considers information such as the locality of vertexes, reduces a large amount of information transmission, improves the efficiency of a distributed group enumeration algorithm, and can rapidly calculate a result according to graph data; and the algorithm does not need redundant dispatching servers to distribute tasks, and the graph is automatically segmented in the process of reading the graph.

Furthermore, the method and the system can use the dual-mode engine to search the vertex list which is larger than the vertex v, thereby greatly reducing the data competition during updating the main copy and reducing the cost caused by modifying the correctness for maintaining concurrency.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

Fig. 1 is a flow chart of a method for enumerating a maximum clique based on a distributed system according to an embodiment of the present invention.

FIG. 2 is a flow chart of a method for obtaining a neighbor list of each vertex by a node based on a dual mode engine for information transfer between nodes in an embodiment of the invention.

FIG. 3 is a schematic diagram of information transfer between nodes using a dual mode engine in an embodiment of the present invention.

Fig. 4 is an example of an undirected weight graph with 8 vertices.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

The existing mapReduce-based distributed algorithm not only needs to perform a large number of disk IO/remote copies, is unfavorable for real-time calculation, has high communication cost and higher requirement on bandwidth, but also needs to have a main node to schedule and manage resources, when the cluster of the processing algorithm is smaller, the resource waste is caused, and the maintenance cost of a system is greatly increased due to the existence of the main node. Further, the partitional algorithm does not fully take into account and exploit locality to vertices.

Processing large graph data is a relatively popular problem at present, so that efficient processing algorithms are crucial for graph data analysis. Aiming at the defects of the existing mapReduce-based maximum group enumeration algorithm, the invention provides a novel distributed system-based maximum group enumeration method. Preferably, for the high efficiency of information transfer, the embodiments of the present invention also use a dual mode engine to transfer a portion of the information, considering the different edge densities. Meanwhile, the invention fully considers the locality of the nodes when designing the algorithm, maximizes the advantage of locality in storage and calculation, and reduces information transmission as much as possible. In the implementation process of the algorithm of the embodiment of the invention, each server is in an equal position, and a master (master) server is not needed for scheduling resources.

Fig. 1 is a flow chart of a distributed system-based method for performing maximum clique enumeration according to an embodiment of the present invention, for calculating all maximum cliques for a given graph g= (V, E). As shown in fig. 1, the method comprises the steps of:

Step S110, renumbering is carried out based on the attribute or the position of the vertex in the undirected unowned graph, so that the node numbers with similar attribute or position are adjacent.

Given an undirected weight graph, represented for example by g= (V, E), where V and E represent the set of vertices and edges in the graph, respectively. For each vertex v in the drawing, a unique ID (v) needs to be assigned, for example, ID (v) may be obtained based simply on degree (D) of the vertex or degeneracy order, but the present invention is not limited thereto. For any vertex V in V, its neighbors are represented as follows:

(1) If vertex u and vertex v are connected by an edge on graph G, then vertex u is referred to as the neighbor of vertex v, which can be expressed as: adj (v) = { u: (u, v) ∈E }.

(2) If vertex u is a neighbor of vertex v and the ID value of vertex u is less than the ID value of vertex v, then vertex u is referred to as a neighbor of rank less than vertex v. The neighbors with rank smaller than vertex v in the neighbors of vertex v are expressed as: adj (< v) = { u: u.epsilon.adj (v), ID (u) < ID (v) }.

(3) If vertex u is a neighbor of vertex v and the ID value of vertex u is greater than the ID value of vertex v, then vertex u is referred to as a neighbor of rank greater than vertex v. The neighbors with rank greater than vertex v in the neighbors of vertex v are expressed as: adj (> v) = { u: u.epsilon.adj (v), ID (u) > ID (v) }.

In the present invention, if all maximum groups in the graph G are denoted by M (G), a series of maximum groups in M (G) whose ID value is the smallest by vertex v is denoted by M _v, and M _v＝{C：C∈M(G),v＝argmin_u∈C ID (u) }, where C is any one of the maximum groups in M _v, "v=argmin _u∈C ID (u)" means that vertex v belongs to C and vertex v is the smallest ID value among all vertices of maximum group C, that is, ID (v) =min { ID (u): u e C.

In an embodiment of the present invention, the ID value of each node in the graph G may be calculated before step S110, or may be calculated in the following step S130.

The invention stores the graph information in a distributed mode. Because many real-world large-scale graphs have natural locality, such as locality of semantic attributes, vertex positions and the like of vertices in the graph, in the embodiment of the invention, the graph is preprocessed by considering the locality of nodes, and renumbered based on the attributes or positions of the vertices in the graph G, such as that the adjacent vertex numbers are reassigned by searching and the like to the vertices with similar attributes or positions of the vertices in the graph G, so that the node numbers with similar attributes or positions are adjacent. After renumbering, adjacent vertices are likely to be very close in number. The attribute of the vertex may be, for example, a local semantic attribute of the vertex, but the present invention is not limited thereto. The position of the vertex may be, for example, the position (edge or center) where the vertex is located throughout the figure.

Step S120, dividing the renumbered vertex set into a plurality of continuous numbered intervals according to the number, and distributing the vertex set corresponding to each continuous numbered interval to each node, so that each node corresponds to one numbered interval.

In this step, the renumbered vertex set is divided into a plurality of consecutive numbered sections (or consecutive blocks) according to the new number obtained in step S110, so that the locality of the original image can be effectively maintained on the subgraph formed by each numbered section.

After the vertex set V is divided into a plurality of consecutive numbered sections, the vertex set corresponding to each numbered section is assigned to each node. In the embodiment of the invention, the number of vertexes is not only taken as the consideration when dividing the interval, and the edge number of each vertex is considered for ensuring the load balance, so that all the vertexes are uniformly distributed on all the nodes. Such a block partitioning enables scalability of the system at a minimal cost. That is, when the amount of data to be processed by the algorithm is greatly increased, it is possible to process larger data by increasing machines (nodes) entirely, and the block division can still effectively divide the overall task. Because the block division allocates a segment of continuously numbered vertexes to a node, the invention can complete a series of frequently invoked basic operations such as vertex hosting query and the like by only recording some simple data, such as the initial vertex number of the allocated vertex interval and the total number of the allocated vertexes, on each node.

When each node stores vertex information, a vertex appears in two ways: a master copy or a mirror copy. If vertex a is assigned to a node a (i.e., the node a owns the vertex a), then at that node a, this vertex a is referred to as the primary replica, and the corresponding node a is responsible for maintaining the state and data of the primary replica vertex a; if there is an adjacent edge between this vertex a and the other vertex B assigned to the other node B, then there is a mirror copy of this vertex a on this other node B. Similarly, vertex B is referred to as the primary copy at node B and as the mirror copy at node A. In the existing scheme, each slave node (worker) reads data to be processed, but does not interact with other slave nodes, intermediate results are all written into middleware (or a local disk), and then the slave nodes re-read the results to perform the next search operation.

In the embodiment of the invention, the advantage of locality is maximized in storage and calculation, and information transmission can be reduced as much as possible.

Step S130, determining the ID value of each vertex in the undirected weight graph.

In other embodiments of the present invention, the step of determining the vertex IDs may also be performed before the vertex is segmented in step S110, i.e. the ID values of the respective nodes are determined before the vertex is segmented and numbered again.

In step S140, in the graph calculation process, information is transferred between the node where the primary copy is located and the node where the corresponding mirror copy is located, so that each node obtains a first neighbor list of each primary copy based on the transferred information.

For an undirected graph g= (V, E), the task of searching for the bigram is divided into a plurality of subtasks and distributed to each node, so that each node searches for the bigram with its own master copy. However, since the graph G is not completely stored in each node, only a part of graph information is stored in each node, and thus, interaction between different nodes is necessary to accomplish the purpose of transmitting a part of information.

The invention does not utilize the mode of remote reading to carry out information interaction, thereby avoiding the disk IO with slow speed. In an example of the present invention, information transfer between nodes may be directly performed using a multi-point interface (MPI), and most of the information required for each node is actually stored on its own machine due to the locality of vertices, and need not be acquired to other nodes, and only a small part of the information required for each node is transmitted, which avoids a large amount of information transfer. Therefore, before searching, it is necessary to calculate which information each node needs. The information required for each node is described below.

Because each node is to perform a maximum clique search based on its own master copy, i.e., all of its own master copies will be the smallest ID in the enumerated maximum clique, in the embodiment of the present invention, the first neighbor list of each master copy will include a list of neighbors of the current master copy that have IDs greater than the current master copy's own ID. For example, for a primary replica vertex v, its first neighbor list can be represented as: adj (> v) = { u: u e adj (v), ID (u) > ID (v) }.

More specifically, in this step, through the transfer of information between nodes, the mirror image copy may obtain corresponding master copy information and update the first neighbor list of the master copy on the node where the mirror image copy is located according to the corresponding master copy information; or the mirror copy may be caused to obtain a first neighbor list of the mirror copy on the node where it is located and pass on to the corresponding master copy.

In this step, a dual mode engine may be employed to communicate information between nodes, with the nodes obtaining a first neighbor list of the owned vertices. The method specifically comprises the following steps:

step S141, determining whether the current active edge set is in a sparse state or a dense state.

The active edge set has two states, sparse and dense. At some given point in the graph computation process, it may be determined that the current active edge set is in both a sparse and dense state based on the ratio of the active edge set size (i.e., the sum of the degrees of all active vertices) to the total edge number |E|. In the embodiment of the invention, different update transfer models are adopted for the active edge sets in different states.

In step S142, when the active edge set is in a sparse state, the primary copy of one node pushes information of the primary copy to other nodes where the corresponding mirror copy is located, and the node where the corresponding mirror copy is located updates the first neighbor list of the primary copy on the node where the mirror copy is located based on the received primary copy information.

Specifically, when the active edge set is in a sparse state, the distributed system is more suitable for a processing mode using pushing, namely a pushing mode. In the push mode, the node pushes the information of the master copy owned by the node to other nodes where the corresponding mirror copy is located, and then the other nodes update the information of the master copy based on the received information, for example, each node receiving the master copy information (such as the master copy number ID) pushed by the other nodes updates the first neighbor list of the master copy based on the received master copy information. In the sparse state, the number of active edges in the system is small. When one node receives the information pushed by the master copy corresponding to the mirror copy from the other nodes, the neighbor list of the master copy owned by the current node can be updated based on the pushed information, wherein the neighbor list comprises a list of neighbors of the current master copy, and the IDs of the neighbors are larger than the IDs of the current master copy. The overhead is lower because of the smaller number of this neighbor.

FIG. 3 is an example of information transfer between nodes using a dual mode engine. As shown in fig. 3, black indicates that the vertex is a mirror copy and white indicates that the vertex is a primary copy. The dashed lines represent the information transfer that occurs between nodes, while the solid lines are only work that occurs within the nodes. In sparse mode, as shown in the left-hand portion of fig. 3, the mirrored copy v is located on the same node as the primary copy of two other vertices not numbered. The node possessing the primary copy v pushes the vertex number vtx and the vertex ID of the primary copy v, that is, (vtx, ID (vtx)) of the primary copy v as a message to be transferred to the node where the corresponding mirror copy v is located through a sparse signal (SPARSESIGNAL), after receiving the message, the node possessing the corresponding mirror copy traverses the neighbor information stored on the node through a sparse gap (sparse response) (SparseSlot), compares the ID information of the two, and if the surrounding primary copy ID is smaller than the received ID (vtx), writes vtx into the neighbor queue table possessed by the primary copy. In this example, the primary copy v sends its own number and ID value, i.e. (vtx, ID (vtx), as a message to its mirror copy v, which, after receiving this message, traverses its own two neighbors, if it finds that its own ID is greater than that of the neighbor, adds itself to its neighbor (primary copy) first ID list adj (> v) that includes, among its neighbors, the neighbor whose ID is greater than that of the current primary copy itself.

Step S143, under the condition that the active edge set is in a dense state, the node where the mirror image copy is located counts all first neighbor lists of the mirror image copy in the node, and transmits the generated neighbor list to the corresponding master copy.

When the active edge set is in a dense state, the distributed system adopts a pull processing mode, namely a pull mode. The active edge set in the dense state can obtain higher benefit from a pulling mode, in the pulling mode, the node collects information from the neighbor for each mirror copy to complete the update of the state and the data, and then the updated information is pulled by the main copy corresponding to the mirror copy to complete the update of the main copy. Since there is a high probability that one primary copy will have multiple mirror copies, data competition will occur when there are multiple mirror copies updating the primary copy, and the data lock will inevitably be used, and the data lock will be less efficient to some extent. However, in the pull mode, since the number of mirror copies owned by one master copy is the same as the number of nodes at most, data competition during updating is greatly reduced, and overhead caused by modifying correctness for maintaining concurrency is reduced.

In pull mode, as shown in the right part of fig. 3, each mirror copy counts the list of vertices that are owned by the node with all ID values greater than itself by means of a dense signal (DENSESIGNAL), and then sends this list to its primary copy over a dense gap (DenseSlot), the collection of lists of vertices obtained by the respective mirror copy from the node at which it is located constituting the first neighbor list adj (> v) of primary copy v. As shown in fig. 3, the node where the mirror copy is located first completes updating the mirror copy by traversing the neighbors of the mirror copy, that is, collecting all vertex numbers with IDs greater than those of the mirror copy v in the neighbors, after the collection is completed, the node where the mirror copy is located sends part of information collected by the mirror copy to the node where the corresponding master copy is located through information transfer between nodes, and the nodes where the master copy is located are consolidated and converged to obtain a first neighbor list adj (> v) of the master copy.

The embodiment of the invention counts the neighbors of the vertex v, which are larger than the ID value of the vertex v, on the basis of the dual-mode engine, and is used for obtaining the operation of the adjacency list of the vertex in the next step.

In step S150, each node obtains an adjacency list of each primary copy vertex, and performs enumeration of a great group by using the search tree based on the obtained adjacency list information.

For any one maximum clique C in M _v, since v is the smallest ID in the clique, all vertices other than C\ { v } should belong to adj (> v). To calculate the maximum clique C, it is necessary to know whether the edge (u, w) belongs to the edge set E for any two vertices in the clique. Therefore, all adj (u) of u εadj (> v) need to be obtained. In verifying a superclique, complete adj (u) and adj (v) information is needed, because some vertices w (w e adj (u) and w e adj (v), ID (w) < ID (v)) will be used to determine the superclique. For example, assuming that the superset { w, u, v } is a clique, if there is no information about the vertex w, the algorithm may misconsider { v, u } as a very large clique.

After the first neighbor list adj (> v) of the primary copy on the node is known by the aforementioned step S140, the next step is to obtain an adjacency list of these vertices. The adjacency list required for M _v is:

{(u，adj(u))：u∈adj(>v)}∪{(v，adj(v))}。

In the embodiment of the invention, because of the locality of vertex distribution, most of the vertices in the cluster are probably positioned on the same server, and for the vertices, the adjacency list can be conveniently and directly obtained, so that the cost for transmitting the adjacency list between the servers is avoided. For information transfer that has to be done, the present invention may use a multipoint interface (MPI) to enable the sending and receiving of adjacency lists using dedicated threads. While sending and receiving information, the vertex nodes that have acquired all adjacency list information can start searching and enumerating the biggest cliques. And the information collection work of other vertexes overlaps with the enumeration work at the moment. Compared with the existing linear processing mode (firstly collecting all information and then sequentially enumerating), the scheduling mode conceals a part of communication cost. In the embodiment of the invention, one process is created to receive information, and the other process is created to send information. If all the adjacency list information needed by a vertex is already obtained, then the node where the vertex is located can directly use the already obtained information to start cluster enumeration based on the vertex.

The enumeration of the biggest group by using the search tree based on the obtained adjacency list information can be realized by adopting the existing biggest group enumeration method, and the enumeration of the biggest group is preferably performed by using the search tree. The search tree is a simple tree enumeration that is searched by a constantly recursive search.

Since for any one of the groups C in M _v there isTherefore, to enumerate the biggest cliques in M _v, the intersection of vertex u neighbor adj (u) and vertex v neighbor adj (> v) with an ID value greater than that of vertex v in graph G, i.e., adj (u) ≡adj (> v), is required. In the embodiment of the invention, ADJ _>v [ u ] is used for representing the neighbor of v in the neighbors of the vertex u, and the ID of the neighbor is larger than v. In addition, to judge the maximization of the clique, ADJ (u) ≡adj (v) needs to be calculated, and the invention uses ADJ _v [ u ] to represent the vertex of the neighbor of v as well as the neighbor of u. The above information is entered into the search function as initial information for the search.

In the search process, the information of the current cluster being enumerated is represented by C, the candidate vertex set that can be added to the current cluster is represented by "candates", and the vertex set that has been previously searched and can be added to the current cluster is represented by "prev". Then initially only one vertex v is included in C and all vertices of the candidate set contain adj (> v). The algorithm then repeats the following steps until the candidates set is empty: taking a vertex u from the candidates set, adding the vertex u to the set C if it is determined that each of the vertices u and C have edges connected. When the candidate set is empty and the prev node is also empty, it is stated that the algorithm has found a very big clique.

When the algorithm adds vertex u to the current clique C, the candidates needs to be updated. The Candida collection is cross-manipulated with ADJ _>v [ u ]. At the same time, the prev set is updated by performing a cross-over operation on prev and ADJ _v [ u ], since if another maximum clique C 'exists, it does not render C' a maximum clique, which means that (C '\C') must be a subset of ADJ _v [ u ].

Since M _v is all the biggest groups with v as the smallest vertex, then M (G) is the union of all M _v, i.e. M (G) = _v∈VM_v.

By the above steps, all the biggest clusters of the undirected unauthorized graph are obtained.

For further optimization, the invention may also utilize pivot (pivot) for pruning. Pivot pruning requires that a vertex u _p be selected as the Pivot, then ADJ _>v[u_p be removed from Candida, and the scope of Candida be greatly reduced, thereby reducing the search scope. The choice of pivot plays an important role in the embodiment of pruning effect, and the embodiment of the invention adopts the following modes to choose pivot: for the vertex u in the current Candida set, ADJ _>v [ u ] for each vertex u in Candida is taken first. Then, the set is greedy colored, and the vertex u corresponding to the intersection with the highest color number is selected as the pivot of the round. Since the Pivot pruning technique is already used in the existing biggest group search process, the detailed description is omitted again.

The following describes a distributed system-based method of bolus enumeration in accordance with embodiments of the present invention in conjunction with specific examples.

Giving an undirected and unauthorized graph G, dividing the whole algorithm into two steps, firstly obtaining adj (> v) of all vertexes v through a double mode, and obtaining a corresponding adjacency list; then, a search is performed using the above information. The input of the algorithm is graph G, and the output is all the biggest clusters.

Before executing the method of the invention, a cluster can be deployed to have the environment of code operation and the operation conditions of MPI and openmp, wherein MPI is used for sending and receiving between machines, and openmp is used for realizing multi-thread operation on a single machine, so as to accelerate the operation efficiency. Since each machine in the cluster is in equal position, the same code can be run on each machine.

After the cluster deployment is completed, the following operations may be performed:

the first step: each node reads part of the graph information, records own main node interval and the adjacency list information of the main node, and performs graph preprocessing.

The same graph reading work is executed on each node (each machine), the graph reading work firstly reads the full graph, the degree information of each vertex is recorded, and the degree information is recorded for the subsequent node segmentation work, so that the load balance is ensured. And then, dividing the vertexes by each node based on the attribute or the position of the vertexes, enabling the node numbers with similar attribute or position to be adjacent, obtaining a numbered interval corresponding to the divided vertexes, and recording which vertexes are positioned in the node. If a node contains multiple processors, the vertex in the node can be divided again to obtain multiple continuous numbered intervals. For example, on a node containing s processors, the vertex interval may be divided into s smaller intervals, and these intervals are assigned to each processor one-to-one. Such an operation may properly distribute the computing data to a particular processor, further enhancing the performance of the computation.

And a second step of: the nodes on each vertex calculate their own ID values, respectively.

After the graph reading is completed, vertex data distribution and collection are performed. Before the data distribution and collection work is completed, the ID value of each vertex is calculated, and the ID value of each vertex can be calculated based on the degree of the vertex or degeneracy orders, for example. In the embodiment of the invention, the ID value of each vertex can be calculated by using the openmp in each node, the openmp is used for multi-thread parallel operation, the calculation speed of a program can be accelerated, and the algorithm can be helped to calculate the vertex ID more quickly. After the ID is calculated, the ID value is collected for each node.

In other embodiments of the present invention, the step of calculating the vertex IDs may also be performed prior to splitting the vertices, i.e., the ID values of the respective nodes are determined before the vertices are re-split and numbered.

And a third step of: next each node uses the dual mode engine to find a list of vertices that is larger than vertex v. The pushing mode is adopted in the sparse state, and the pulling mode is adopted in the dense state.

Each node obtains ID information of all neighbor vertices of its own vertex by the dual mode engine and obtains adj (> v) of each vertex v.

Fifth step: and transmitting adjacency list information between servers.

After obtaining the vertex list, two threads may be created, one to receive the adjacency list and one to send the adjacency list using the MPI command.

Sixth step: searching is performed by using the search tree. Wherein the fifth step and the sixth step overlap in time to some extent for different vertices.

The number of times each node is enumerated is related to the number of vertices it owns. For vertex v at each node, a series of values for ADJ _>v [ u ] and ADJ _v [ u ], candidate set, etc. are first calculated. A recursive search is performed using these values. In the searching process, whether a very large clique is found is judged by the candidates set and the prev set. If there are more vertices in the candidates set, the vertices are further added to the current clique, and the information of ADJ, candidates, prev and the like is updated and used as the next search.

Since the data structures need to be updated by frequent intersection operations of the collection in the searching process, which is a very time-consuming part, in order to quickly determine whether the vertex exists, the embodiment of the invention uses bitmap to record the vertex, thus saving time and space.

The details of the sixth step operation will be described in detail below using the vertex a in fig. 4 as an example:

Input: each vertex v, adj (> v) adjacency list of vertices v. With respect to the graph shown in fig. 4, simply taking the degree (D) of the node as the criterion for obtaining the ID, as shown in fig. 4, D (a) =4, D (b) = 6,D (c) =3, D (D) =4, D (e) = 7,D (f) =3, D (g) =5, D (h) =2. On the principle that the ID with small D is small (the ID with small vertex number is smaller if D is the same). ID (h) =1, ID (c) =2, ID (f) =3, ID (a) =4, i (d) =5, ID (g) =6, ID (b) =7, ID (e) =8 can be obtained. From the figure, adj (a) = { d, g, b, e } is obtained, since the IDs of these neighbors are all larger than a, adj (> a) = { d, g, b, e }. The process of searching the biggest group is as follows:

(1) For each vertex u e ADJ (> v), the node computes the vertex of ADJ _>v [ u ] and ADJ _v[u].ADJ_v [ u ] that represent neighbors of vertex u that are also neighbors of vertex v at the same time. Taking vertex b as an example, the neighbors a, d, g, e, f, c of vertex b, where d, g, e are also neighbors of vertex a, thus ADJ _a [ b ] = { d, g, e }, and ADJ _>v [ u ] represents one of the neighbors of vertex u that is also the neighbor of vertex v and has a larger ID than vertex v. Clearly, ADJ _>a [ b ] = { d, g, e }. And similarly, ADJ _>a [ u ] and ADJ _a [ u ] of other vertexes can be obtained.

(2) An initial clique c= { v }, candidate set candates, and prev set for judging whether or not it is a very large clique are defined. At this point c= { a }, the candidate set is adj (> a), i.e. candidates= { d, g, b, e }. The prev set is an empty set.

(3) If both the candidates set and prev set are empty, this indicates that a very large clique was found. In the first tier of recursion, the candidates set size is 4 and the prev set is empty, so the search does not stop.

(4) If candidates is not empty, then one pivot vertex U _p is selected, defining the set u=candidates/ADJ _>v[u_p ]. In the first tier of recursion, the algorithm will find a point in candidates= { d, g, b, e } as pivot. ADJ _>a [ u ] for each vertex u in Candida is first taken. Then, the set is greedy colored, and the vertex u corresponding to the set with the highest color number is selected as the pivot of the round. For example, the three points { b, g, e } in ADJ _>a [ d ] of d are colored greedy, and the total three colors in the coloring number set are the same, in this example, the coloring numbers of the four vertexes d, g, b, e are the same, so that the vertex b is randomly selected as the pivot. According to definition set u=candates/ADJ _>v[u_p ], u= { b } is because ADJ _>a [ b ] = { d, g, e }.

(5) The vertices U in set U are ordered in the order of |ADJ _>v [ U ] | from large to small. In this case, only one vertex b remains in the set U.

(6) For each vertex U in set U, a new candidates set cand 'is created and ADJ _>v [ U ] is updated to ADJ' > _v [ U ]. Since vertex u is a candidate node to be added to the clique in the future, the new set of candidate nodes should all be neighbors of u. In this example u=b, so for vertex b, the value of cand 'should be the intersection of candates and ADJ _>a [ b ], i.e., cand' = { d, g, e }. The ADJ _>v u is then updated, and since the search range is limited to cand ' later, some cand ' independent data can be excluded by interleaving this data with cand '. At this time, there are three vertices { d, g, e } in cand ', and finally, ADJ' _>a [ d ] = { g, e } is obtained from ADJ _>a [ d ] = { b, g, e }.

(7) And performing recursive search by using the newly obtained information. In this step the algorithm step returns to step (3) and the number of recursion layers is increased by one.

(8) Vertex u is added to the prev set. When the search beginning with vertex b in step (6) ends, the program loops back to this step and adds b to the prev set.

According to the calculation result, three maximum groups in G are obtained, namely { a, b, d, e, G }, { b, c, e, f }, { e, G, h }. A series of maximum cliques with v as the minimum node is denoted by M _v, which can be obtained as M _h＝{e,g,h},M_c＝{b,c,e,f},M_a = { a, b, d, e, g }.

The distributed maximum group enumeration algorithm can be combined with a dual-mode engine to realize parallel processing in nodes and among nodes. The technology of the invention can be used in the field of large graph data analysis, such as social network analysis, web network mining and the like.

The maximum group enumeration algorithm based on the distribution system has the following effects because the information such as the locality of the nodes is fully considered, and a large amount of information transfer is reduced:

(1) The result can be rapidly calculated as long as the required graph data is provided;

(2) Unnecessary dispatch servers are not needed for task allocation, and graph segmentation is automatically performed in the process of graph reading.

(3) The invention fully considers the locality of the vertexes, reduces the transmission quantity of information and improves the efficiency of the distributed group enumeration algorithm.

In accordance with the foregoing method, the present invention also provides a device for performing very large clique enumeration using a dual mode engine based on a distributed system, the device comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the device implementing the steps of the method as described above when the computer instructions are executed by the processor.

The present invention also relates to a storage medium having stored thereon computer program code which when executed may implement various embodiments of the method of the present invention, the storage medium may be a tangible storage medium such as an optical disk, random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of tangible storage medium known in the art.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. The method processes of the present invention are not limited to the specific steps described and shown, but various changes, modifications and additions, or the order between steps may be made by those skilled in the art after appreciating the spirit of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. The present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for enumeration of a bigram based on a distributed system, the method comprising the steps of:

Dividing the renumbered vertex set into a plurality of continuous numbered intervals according to the number, and distributing the vertex set corresponding to each continuous interval to each node so that each node corresponds to one numbered interval;

Determining the ID value of each vertex based on the degree or degradation order of each vertex in the undirected unowned graph;

each node obtains an adjacency list of each primary copy vertex, and based on the obtained adjacency list information, enumeration of a great group is performed by utilizing a search tree;

wherein the step of performing enumeration of the great group by using the search tree based on the obtained adjacency list information comprises the following steps:

for each master copy owned by each node, selecting a vertex from a candidate vertex set, wherein the candidate vertex set comprises vertexes which are not searched for the current master copy, are neighbors of the current master copy and have an ID larger than the current master copy;

2. The method according to claim 1, wherein the transferring information between the node where the primary copy is located and the node where the corresponding mirror copy is located, so that each node obtains a first neighbor list of each primary copy based on the transferred information, includes:

Determining whether the current active edge set is in a sparse state or a dense state;

under the condition that an active side set is in a sparse state, pushing main copy information to a node where the mirror copy is located by a node where the main copy corresponding to the mirror copy is located, and updating a first neighbor list of the main copy on the node where the mirror copy is located by the node where the mirror copy is located based on the received main copy information;

And under the condition that the active edge set is in a dense state, the node where the mirror image copy is located counts all first neighbor lists of the mirror image copy in the node, and transmits the generated neighbor list to the corresponding master copy.

3. The method of claim 1, wherein the information transfer is performed between the node where the master copy is located and the node where the corresponding mirror copy is located using a multi-point interface MPI.

4. The method of claim 3, wherein the step of,

The information transfer between the node where the master copy is located and the node where the corresponding mirror copy is located by using the multi-point interface MPI comprises: using special threads to transfer information between a node where a main copy is located and a node where a corresponding mirror copy is located by utilizing a multi-point interface MPI;

Each node obtains an adjacency list of each vertex, and based on the obtained adjacency list information, enumeration of a great group is performed by utilizing a search tree, and the method comprises the following steps: for any vertex in a node, if the node has obtained the complete information needed to enumerate the any vertex, then the vertex is enumerated in a very big clique.

5. The method according to claim 1, wherein the method further comprises: pruning operation is carried out by utilizing the principal component, and the method comprises the following steps:

selecting a vertex as a principal component;

and removing the vertex set which is the neighbor of the current main copy and has the ID larger than the current main copy from the neighbor of the selected vertex in the candidate vertex set.

6. The method of claim 5, wherein selecting a vertex as a principal element comprises:

selecting the vertexes in the current candidate vertex set, wherein the vertexes in the candidate vertex set are not only neighbors of the current primary copy, but also vertex sets with ID larger than the current primary copy;

and (3) greedy coloring is carried out on the selected set, and the vertex corresponding to the intersection with the greatest coloring is selected as the principal element.

7. The method according to claim 1, wherein the method further comprises: the recording of vertices is done using bitmaps.

8. A distributed system based device for very large group enumeration, the device comprising a processor and a memory, wherein the memory has stored therein computer instructions, the processor being adapted to execute the computer instructions stored in the memory, the device implementing the steps of the method according to any of claims 1-7 when the computer instructions are executed by the processor.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.