[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114970684A - Community detection method for extracting network core structure by combining VAE - Google Patents

Community detection method for extracting network core structure by combining VAE Download PDF

Info

Publication number
CN114970684A
CN114970684A CN202210483834.7A CN202210483834A CN114970684A CN 114970684 A CN114970684 A CN 114970684A CN 202210483834 A CN202210483834 A CN 202210483834A CN 114970684 A CN114970684 A CN 114970684A
Authority
CN
China
Prior art keywords
matrix
community
network
core structure
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210483834.7A
Other languages
Chinese (zh)
Inventor
费蓉
万雨欣
李爱民
王战敏
马梦阳
吴丽丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202210483834.7A priority Critical patent/CN114970684A/en
Publication of CN114970684A publication Critical patent/CN114970684A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a community detection method for extracting a network core structure by combining VAE. The method comprises the steps of searching core information in a network by using a k-tress algorithm in combination with topological structure information and core structure information in a real network, performing feature dimensionality reduction and feature extraction on a similarity matrix with a network core structure by using a variational automatic encoder model (VAE), extracting a mean value and a logarithmic variance in a training process, generating random numbers with different features according to samples with different features, and generating data according to the normally distributed random numbers. The invention uses the coding part to reduce the data dimension, minimizes the reconstruction error and the divergence loss to obtain the optimal solution. The existing algorithm mostly utilizes a classical system, has no obvious effect on a complex network, and has long experiment time and great difficulty. Compared with other methods, the method has better effect, and can obtain a complete and more accurate community division result.

Description

Community detection method for extracting network core structure by combining VAE
The technical field is as follows:
the invention relates to a novel method for carrying out community detection on a real network by combining a logic analysis technology and an automatic encoder model, belonging to the technical field of community detection and particularly combining a k-tress algorithm and a variational automatic encoder model (VAE).
The background art comprises the following steps:
with the continuous development of the internet and artificial intelligence technologies, social networks have appeared and gradually emerged in our daily lives, which can show real social groups consisting of people with the same interests, professions and abilities. At the heart of a social network is a user and the relationship between users. The nodes of a real social network can typically reach tens of millions or even billions, and such networks are often complex. Therefore, for the above situations, a method must be used to discover and identify these real complex network structures. The community detection technology is an extremely important method for understanding and exploring a plurality of real complex network structures in the real world, and after the 21 st century, the research and analysis of the community structures in the complex networks are concerned by a plurality of scholars, and a plurality of novel community detection algorithms also appear.
Most of traditional community discovery algorithms are researched aiming at the topological structure of a community network, but a network core structure often exists in the community network, and the core structure usually plays a key role in network structure analysis. The network core structure is represented as a compact and dense point set connected among nodes, and is closer to the definition of a community structure in which nodes in the community are connected densely and nodes outside the community are connected sparsely. Meanwhile, the characteristics of the core structure in the network can be better extracted by training by combining community information in a real network and a variational automatic encoder model. Therefore, aiming at the situations, the network topology structure and the core structure in the network are comprehensively considered for carrying out comprehensive analysis so as to obtain a better result and improve the existing community detection algorithm.
Disclosure of Invention
The invention provides a community detection method for extracting a network core structure by combining VAE, which is characterized in that a k-tress algorithm is used for finding a k-tress structure in a network, and the method specifically comprises the following steps:
step 1: carrying out data preprocessing on an original data set, establishing an adjacent matrix of a network diagram, and generating a similarity matrix of the adjacent matrix;
step 2: performing feature dimension reduction and feature extraction on the similarity matrix with the network core structure by using a variational automatic encoder model;
and step 3: the algorithm model based on the variational automatic encoder consists of a community information extraction module and a deep learning module, and is combined with community information to perform adjustment training under a given deep learning framework to obtain low-dimensional feature representation of a network structure;
and 4, step 4: and 3, clustering the low-dimensional feature matrix obtained in the step 3 by using a K-means algorithm to finally obtain a community division result.
Further, step 1 specifically includes:
1.1: acquiring data in an original data set, and establishing an original network diagram;
1.2: establishing an adjacency matrix for the original network diagram in the step 1.1;
1.3: calculating the support degree of each edge in the original network graph in the step 1.1;
1.4: calculating a k-tress core structure in the original network diagram in the step 1.3;
1.5: calculating the similarity of each node according to the adjacency matrix in the step 1.2 and the k-tress core structure in the step 1.4;
1.6: and (4) according to the adjacent matrix in the step 1.2 and the similarity of each node in the step 1.5, calculating and generating a similarity matrix.
Further, in step 1.2, given graph G ═ V, E, where V is the set of nodes in graph G and E is the set of edges in graph G, let n (u) be the set of adjacent nodes to node u, let matrix a ═ a ij ] n×n For the adjacency matrix of FIG. G, n represents the dimension of the matrix, when a ij When 1, it represents an edge e ij Exist when a ij When equal to 0, represents an edge e ij Is absent.
Further, step (ii)In step 1.5, for graph G ═ V, E, i, j ∈ V, E ij E, adjacency matrix A ═ a ij ] n×n Edge e ij Support degree of (e) ij ) Edge e ij The similarity of (c) defines:
Figure BDA0003628554920000031
further, in step 1.6, for graph G ═ (V, E), its similarity matrix is defined as X ═ X ij ] n×n Wherein x is ij =Sim(e ij ),i,j∈V,e ij ∈E。
Further, step 2 specifically includes:
2.1: establishing a variation automatic encoder model;
2.2: inputting the similarity matrix obtained in the step 1 into a coding part of a variational automatic coder model according to a formula H i =σ(W 1 X i +b 1 ) Calculating the forward propagation process to obtain a hidden layer representation H i
2.3: extracting each hidden layer representation H in step 2.2 i Mean and logarithmic variance of;
2.4: sampling the random parameter ε from a standard normal distribution N (0,1) i Update H i =μ i + ε i ×σ i Will update the H i As input to the decoding section;
2.5: h obtained in step 2.4 i Input into the decoding part of the variational automatic encoder model according to the formula Y i =σ(W 2 H i +b 2 ) Calculating the forward propagation process to obtain data and generating Y i
2.6: updating neural network parameter variables W through an error back propagation algorithm 1 ,W 2 ,b 1 ,b 2
2.7: when the training target required by the model is reached, continuing the next step, otherwise, repeatedly executing the step training step;
2.8: when the training times required by the model are reached, continuing the next step, otherwise, repeatedly executing the training step;
2.9: when the requirements of step 2.7 and step 2.8 are met, the matrix trained by the variational automatic encoder model is obtained.
Further, in step 2.6, the cross entropy loss function in the variational automatic encoder is defined as:
Figure BDA0003628554920000041
wherein V is the number of network graph vertexes, X i Is the ith column vector, Y, of the matrix in the input layer i Is the ith column vector of the matrix in the output layer;
the KL divergence function is defined as:
Figure BDA0003628554920000042
wherein, mu i Is taken as the average value of the values,
Figure BDA0003628554920000043
is variance, m is hidden layer H i Of (c) is calculated.
Further, step 3 specifically includes:
3.1 extracting a community information matrix C;
3.2, establishing a variational automatic encoder model, wherein the settings are the same as the settings in the step 2 except for the total loss function;
3.3, inputting the matrix obtained in the step 2.9 into the variational automatic encoder model in the step 3.2, and obtaining a low-dimensional feature matrix through training.
Further, in step 3.1, the definition of the community information matrix C:
for the network graph G, for the obtained community information matrix C ∈ R |V|×m Where m is the number of communities;
each row vector of matrix C represents the probability that a different vertex belongs to a certain community.
Further, in step 3.2, a layer and mean μ is added x The same dimension vector mu c Namely the average value of the community information distribution,
Figure BDA0003628554920000051
is the variance. The corner mark x represents adjacency information, and the corner mark c represents community information. After community information is added, the new KL divergence loss function is changed into:
Figure BDA0003628554920000052
wherein m is a hidden layer H i Of (c) is calculated.
The algorithm model in the invention can be applied to reveal the aggregation behavior of discovering a certain real network, namely discovering and analyzing the community structure in the real network. In a complex real network, feature extraction and dimension reduction can be carried out on a matrix established according to the network, clustering is carried out, and division of community structures is completed. In the social network which is most closely related to the internet, the social network platform can establish a network information matrix according to information uploaded by a user and various information such as age, sex, profession, hobby, use time and the like, and the division of different communities is completed through the algorithm model in the invention. The method can be generally used for social software friend circle analysis, professional introduction analysis, friend discovery and other related functions. In other application fields, such as biology, the method can be used for analyzing complex food chain network structures, and can also be used for feature extraction and dimension reduction of complex protein networks to complete the division of different protein structures.
Drawings
Fig. 1 is a main flow diagram of the present invention.
FIG. 2 is a detailed flow chart of specific steps of the present invention.
FIG. 3 is a diagram of a variation autoencoder model of the present invention.
Fig. 4 is a diagram of a small network topology structure, which specifically illustrates an example of a k-tress network core structure according to the present invention.
Fig. 5(a) -5 (C) illustrate the community partitioning effect on the Dolphins network, and the numbers on the nodes represent the community labels into which the nodes are partitioned. In this case, the graph (a) shows a real community division situation, the graph (B) shows a community division situation according to the present invention, and the graph (C) shows a community division situation of the FUA algorithm.
Fig. 6(a) -6 (C) are community partitioning effects on the Football network, and the numbers on the nodes represent the community labels into which the nodes are partitioned. In this case, the graph (a) shows a real community division situation, the graph (B) shows a community division situation according to the present invention, and the graph (C) shows a community division situation of the FUA algorithm.
Detailed Description
The technical solution of the present invention is further described below by way of examples with reference to the accompanying drawings.
Example (b):
the following describes a specific method of this embodiment with reference to flowcharts shown in fig. 1 and fig. 2.
The data set used by the invention is a data set which is published abroad, and the required data set is based on a real network community, so that the number of the data sets is huge. Directly inputting data into a neural network can result in inefficient neural network processing. Meanwhile, in a small simple topological structure, the community relation in the graph can be calculated by directly using a clustering algorithm, and the result is accurate. However, the adjacency matrix only records the relationships between adjacent nodes, and for any two nodes in the community that are not connected, they may exist in the same community. The relationship for describing the nodes in the community structure cannot only use the adjacency matrix, and if the used community detection algorithm is directly clustered on the basis of the adjacency matrix, the precision degree of community division is influenced.
Therefore, the invention provides a data preprocessing method for searching a network core structure based on a k-tress algorithm, the k-tress algorithm is used for finding the k-tress structure in a network, the definition of the structure is more consistent with the network core structure, meanwhile, a characteristic matrix with the network core structure is generated by combining a network topological structure, then the matrix is sent to a variation automatic encoder for training, the variation automatic encoder can well extract and reduce the dimension of the network core structure, and then secondary training is carried out by combining community information and the variation automatic encoder, so that the effect of community division is improved.
As shown in the figure, the community detection method for extracting a network core structure by combining with a VAE in this embodiment specifically includes the following steps:
step 1: and carrying out data preprocessing on the original data set, establishing an adjacency matrix of the network graph, and generating a similarity matrix of the adjacency matrix.
The specific steps of data preprocessing are as follows:
step 1.1: and acquiring data in the original data set, and establishing an original network diagram.
Step 1.2: an adjacency matrix is established for the original network graph in step 1.1.
Given graph G ═ V, E, V being the set of nodes in graph G, E being the set of edges in graph G, let n (u) be the set of adjacent nodes to node u, let matrix a ═ a ij ] n×n For the adjacency matrix of FIG. G, n represents the dimension of the matrix, when a ij When 1, it represents an edge e ij Exist when a ij When equal to 0, represents an edge e ij Is absent.
Step 1.3: and (4) calculating the support degree of each edge in the original network graph in the step 1.1.
For the graph G ═ V, E and its subgraph G ', the support degree of the edge E in the subgraph G ' is calculated, and sup (E, G ') represents the support degree of the edge E in the subgraph G ', that is, the number of triangles including the edge E in the subgraph G ' is abbreviated as sup (E).
Step 1.4: the k-tress core structure in the original network map in step 1.3 is computed.
For graph G ═ V, E, sub-graph G is defined k (k.gtoreq.2) is a k-tress structure in the graph G, and a subgraph G is defined k The support degree of any edge e is more than or equal to (k-2). Is marked as
Figure BDA0003628554920000082
Figure BDA0003628554920000083
Step 1.5: and calculating the similarity of each node according to the adjacency matrix in the step 1.2 and the k-tress core structure in the step 1.4.
For graph G ═ V, E), i, j ∈ V, E ij E, adjacency matrix A ═ a ij ] n×n Side e of ij Support degree of (e) ij ) Edge e ij The similarity of (c) defines:
Figure BDA0003628554920000081
step 1.6: and according to the similarity between the adjacency matrix in the step 1.2 and each node in the step 1.5, calculating a similarity matrix:
for graph G ═ (V, E), its similarity matrix is defined as X ═ X ij ] n×n Wherein x is ij =Sim(e ij ),i,j∈V,e ij ∈E。
Step 2: and performing feature dimension reduction and feature extraction on the similarity matrix with the network core structure by using a variational automatic encoder model.
And extracting the mean value and the logarithmic variance in the training process, generating random numbers with different characteristics according to samples with different characteristics, and generating data according to the random numbers with normal distribution. And the data dimension is reduced by utilizing the coding part, and the optimal solution is obtained by minimizing the reconstruction error and the divergence loss.
The specific steps of the transform automatic encoder module in step 2 are as follows:
step 2.1: and establishing a variational automatic encoder model.
With reference to fig. 3, a variational autoencoder model is first established. Then there are:
H i =σ(W 1 X i +b 1 ),Y i =σ(W 2 H i +b 2 )
wherein H i Is the ith column vector of the matrix in the hidden layer; σ is an activation function, such as Sigmoid function; x i Is the ith column vector, Y, of the matrix in the input layer i Is the ith column vector, i table, of the matrix in the output layerDenotes any one of numbers 1 to | V |, W 1 And W 2 Respectively are a weight matrix of the input layer and a weight matrix of the hidden layer; b 1 And b 2 Respectively, the bias vector of the input layer and the bias vector of the hidden layer. The former expression is the coding part, and the latter expression is the decoding part.
Step 2.2: inputting the similarity matrix obtained in the step 1 into a coding part of a variational automatic coder model according to a formula H i =σ(W 1 X i +b 1 ) Calculating the forward propagation process to obtain a hidden layer representation H i And i represents any one of numbers 1 to | V |.
Step 2.3: extracting each hidden layer representation H in step 2.2 i Mean and log variance of.
Setting the mean value of the parameters of standard normal distribution as mu and the logarithmic variance as log sigma 2 . Construction of neural network mu i =f(X i ) And
Figure BDA0003628554920000091
f is the activation function. Wherein, X i Representing the input layer vector, mu i Represents a slave hidden layer H i The average value vector of the extracted average value vector,
Figure BDA0003628554920000092
represents a slave hidden layer H i The extracted log variance vector, f represents an activation function such as a Sigmoid function, and i represents any one of numbers 1 to | V |.
Step 2.4: sampling the random parameter ε from a standard normal distribution N (0,1) i Update H i =μ ii ×σ i Will update the H i As input to the decoding section.
Wherein H i Represents the updated i-th column vector, mu, of the hidden layer i Representing a slave hidden layer H i Extracted ith column mean vector, σ i Represents a slave hidden layer H i Extracted variance vector, ε i Denotes a random parameter (how i is interpreted) sampled from a standard normal distribution N (0,1), i denotes any one of numbers 1 to | V |.
Step 2.5: h obtained in step 2.4 i Input into the decoding part of the variational automatic encoder model according to the formula Y i =σ(W 2 H i +b 2 ) Calculating the forward propagation process to obtain data and generating Y i ,Y i Represents the ith column vector of the output layer, i represents any number from 1 to | V |.
Step 2.6: updating neural network parameter variables W through an error back propagation algorithm 1 ,W 2 ,b 1 ,b 2
Cross entropy loss function L in variational automatic encoder 1 Is defined as:
Figure BDA0003628554920000101
wherein V is the number of network graph vertexes, X i Is the ith column vector, Y, of the matrix in the input layer i Is the ith column vector of the matrix in the output layer, i represents any number from 1 to | V |.
In order to harmonize the proportional degree of the two loss functions, the variational automatic encoder introduces the concept of KL divergence, and the essence of the KL divergence is the regular specification of a discriminant model. KL divergence function L 2 Is defined as:
Figure BDA0003628554920000102
wherein, mu i Is taken as the mean value of the average value,
Figure BDA0003628554920000103
is variance, m is hidden layer H i I represents any number of 1 to | V |.
Total Loss function Loss sum Is defined as:
Loss sum =L 1 +L 2
wherein L is 1 As a cross-entropy loss function, L 2 Is a KL divergence function.
Finally, by error reverse transmissionBroadcast algorithm update parameter variable W 1 ,W 2 ,b 1 ,b 2
Step 2.7: and when the training target required by the model is reached, continuing the next step, otherwise, repeatedly executing the step training step.
Step 2.8: and when the training times required by the model are reached, continuing the next step, otherwise, repeatedly executing the step training step.
Step 2.9: when the requirements of step 2.7 and step 2.8 are met, the matrix trained by the variational automatic encoder model is obtained.
The feature dimension reduction and extraction module performs feature dimension reduction and feature extraction on the similarity matrix with the network core structure by using a variational automatic encoder model, extracts a mean value and a logarithmic variance in a training process, generates random numbers with different characteristics according to samples with different characteristics, and generates data according to the random numbers in normal distribution. And the data dimension is reduced by utilizing the coding part, and the optimal solution is obtained by minimizing the reconstruction error and the divergence loss.
And step 3: the algorithm model based on the variational automatic encoder is composed of a community information extraction module and a deep learning module, adjustment training is carried out under a given deep learning framework by combining community information, low-dimensional feature representation of a network structure is obtained, and the method specifically comprises the following steps:
and 3.1, extracting a community information matrix C.
For the network graph G, for the obtained community information matrix C ∈ R |V|×m
Where m is the number of communities. Each row vector of matrix C represents the probability that a different vertex belongs to a certain community.
Step 3.2 establishes a variational autoencoder model, and the settings are the same as those in step 2 except for the total loss function.
The modification of the loss function is as follows:
adding a layer and the mean value mu x The same dimension vector mu c Namely the average value of the community information distribution,
Figure BDA0003628554920000121
is the variance. The corner mark x represents adjacency information, and the corner mark c represents community information. New KL divergence loss function after adding community information
Figure BDA0003628554920000122
The updating is as follows:
Figure BDA0003628554920000123
wherein m is a hidden layer H i I represents any number of 1 to | V |.
The total loss function is defined as:
Figure BDA0003628554920000124
wherein L is 1 As a cross-entropy loss function, L 2 Is the modified KL divergence function.
And 3.3, inputting the matrix obtained in the step 2.9 into the variational automatic encoder model in the step 3.2, and obtaining a low-dimensional characteristic matrix through training.
And 4, step 4: and 3, clustering the low-dimensional feature matrix obtained in the step 3 by using a K-means algorithm, setting the clustering number as the number of the types of the real community labels, and finally obtaining a community division result.
Examples
Fig. 4 is a diagram of a small network topology structure, which specifically illustrates an example of a k-tress network core structure according to the present invention.
Taking fig. 4 (a topology structure diagram of a small network) as an example, a core network structure (k-tress structure) is specifically described. And when the two nodes have connecting edges, the two nodes are connected by a straight line.
FIG. 4 is a 2-tress structure, where a sub-graph represented by a light color part in the graph is a 3-tress structure, that is, each edge in the sub-graph belongs to at least one triangle, and a sub-graph represented by a dark color part in the graph is a 4-tress structure and is also a 5-tress structure, that is, each edge in the sub-graph belongs to at least two or three triangles. Therefore, the k-tress structure is a layered and progressive subgraph structure, and network cores with different density degrees are depicted by different k values. The invention comprehensively utilizes the k-tress structure to search the network core structure so as to achieve better community division effect.
The invention selects the real network community data set for verification, thereby analyzing and verifying the feasibility and the accuracy of the proposed algorithm model.
The present invention uses ten real datasets: strike, Karate, Dolphins, Football, Email _ Eu, Polblogs, Cora, Citeser, Youtube, and DBLP.
Wherein, Strike, karte, Dolphins and Football belong to small datasets (ten and hundred orders of magnitude), Email _ Eu, Polblogs, Cora and Citeseer belong to medium datasets (thousand orders of magnitude), Youtube and DBLP belong to large datasets (ten thousand orders of magnitude).
The evaluation criteria are set to detect whether the final clustering result meets the expected criteria, and three evaluation indexes are selected in the text. Where Q is used to measure the quality of the final community partition, and NMI and F same The evaluation method is an evaluation method for analyzing according to a real community set.
Given a set of clustering results C ═ { C ═ C 1 ,C 2 ,C 3 …C m I.e. communities found by the algorithm. Set of real communities C '═ C' 1 ,C' 2 ,C' 3 ,…C' n And f, namely, the real existing communities, m is the number of community partition and clustering results, and n is the number of real communities.
The modularity Q is used to measure the quality of the final community partition, which can evaluate the goodness of a community partition. Is defined as:
Figure BDA0003628554920000141
where d is the total number of edges in graph C', m is the number of community partition clustering results,
Figure BDA0003628554920000142
is a drawing C i The total number of the inner edges in (b),
Figure BDA0003628554920000143
is a drawing C i Total number of middle outer edges.
The NMI method is English called Normalized Mutual Information, which means normalization of Mutual Information. NMI is defined as:
Figure BDA0003628554920000144
wherein X is a confusion matrix and X ∈ R (n×n) In the confusion matrix X, rows represent the real community set C', and columns represent the clustering result set C. X ij Representative graph C i And C' j I.e. the overlapping nodes of the two graphs. The more similar the algorithm finds a community to a truly existing community, the closer the NMI is to 1. Conversely, the closer the NMI is to 0.
F same The method comprises the steps of calculating the intersection of communities discovered by each algorithm and the communities in reality, and calculating the mean value of the obtained result. Is defined as:
Figure BDA0003628554920000145
where V is the total number of nodes in the graph C', m is the number of community partition clustering results, and n is the number of real communities.
As described above, the present invention provides a community detection method for extracting a network core structure in combination with a VAE, based on the practical point of view, by combining a network topology in a real network and a core structure in the network, and the method mainly includes the following modules:
1. a data preprocessing module: because the data set used by the invention is a data set disclosed abroad and the data set is needed by experiments to analyze the network topology structure and the core structure in the network in the real network, the core structure in the real network can be well searched by using the k-tress algorithm. The data set used therefore needs to be preprocessed.
2. The feature dimension reduction and extraction module comprises: and performing feature dimensionality reduction and feature extraction on the similarity matrix with the network core structure by using a variational automatic encoder model, extracting a mean value and a logarithmic variance in a training process, generating random numbers with different characteristics according to samples with different characteristics, and generating data according to the random numbers in normal distribution. And the data dimension is reduced by utilizing the coding part, and the optimal solution is obtained by minimizing the reconstruction error and the divergence loss.
3. The community information combining module: in order to effectively combine community information with a variational automatic encoder, an algorithm model based on the variational automatic encoder consists of a community information extraction module and a deep learning module. And adjusting and training under a given deep learning framework by combining community information to obtain the low-dimensional feature representation of the network structure.
4. A clustering module: based on the modules 1, 2 and 3, clustering operation is carried out on the low-dimensional feature matrix obtained after the variational automatic encoder model is trained by utilizing a K-means clustering algorithm, and finally community structure division is completed.
The algorithm model in the invention can be applied to reveal the aggregation behavior of discovering a certain real network, namely discovering and analyzing the community structure in the real network. In a complex real network, feature extraction and dimension reduction can be carried out on a matrix established according to the network, clustering is carried out, and division of community structures is completed. In the social network which is most closely related to the internet, the social network platform can establish a network information matrix according to information uploaded by a user and various information such as age, sex, profession, hobby, use time and the like, and the division of different communities is completed through the algorithm model in the invention. The method can be generally used for social software friend circle analysis, professional introduction analysis, friend discovery and other related functions. In other application fields, such as biology, the method can be used for analyzing the complex food chain network structure, and can also be used for feature extraction and dimension reduction of the complex protein network to complete the division of different protein structures.
Most of traditional community discovery algorithms are researched aiming at the topological structure of a community network, but the core structure in the community network can greatly influence the accuracy of a community detection algorithm, and the final community division can be influenced by a method for reducing the dimension of a characteristic matrix and a clustering process. The invention comprehensively considers the factors, realizes the community detection method for extracting the network core structure by combining the VAE, has better experimental effect on three evaluation indexes compared with other methods through a plurality of experiments, and can obtain a complete and more accurate community division result. The method has the advantages that the method has a good community division effect in the medium-sized, small-sized and dense complex real network, has a certain community division effect in the medium-sized, large-sized and sparse complex real network, and is higher in the accuracy of community structure division compared with other traditional algorithms.
Examples
The experimental effect of the invention can be illustrated by the following simulation experiment:
the experimental environment of the invention is based on a Windows 10 operating system, the software environment is Python language, and the hardware environment is Intel Core i7-9700K CPU and 48GB memory capacity.
Dolphins and Football data sets were selected for comparative experiments using the present invention and the FUA algorithm. The FUA algorithm is a widely used community detection algorithm based on modularity, and the optimization goal of the algorithm is to maximize the modularity of the entire community network.
FIGS. 5(A) - (C) illustrate the effect of community partitioning on a Dolphins network, where the numbers on a node represent the community labels into which the node is partitioned. In this case, the graph (a) shows a real community division situation, the graph (B) shows a community division situation according to the present invention, and the graph (C) shows a community division situation of the FUA algorithm.
As shown in fig. 5, the Dolphins network has 62 nodes, 159 edges, and 2-class community labels in total, the numbers on the nodes represent the community labels into which the nodes are divided, graph (a) shows the real community division, graph (b) shows the community division according to the present invention, and graph (c) shows the community division of the FUA algorithm. It can be found that in the Dolphins network, the community division condition of the invention is completely consistent with the real community division condition, while the FUA algorithm divides the network into 5 communities in error, and the division accuracy is low and is greatly different from the real community division condition.
Fig. 6(a) - (C) are the community partitioning effect on the Football network, and the numbers on the nodes represent the community labels into which the nodes are partitioned. Among them, the graph (a) shows the real community division, the graph (B) shows the community division of the present invention, and the graph (C) shows the community division of the FUA algorithm.
As shown in fig. 6, the Football network has 180 nodes, 788 edges and 12 types of community labels, the numbers on the nodes represent the community labels into which the nodes are divided, the graph (a) represents the real community division condition, the graph (b) represents the community division condition of the invention, and the graph (c) represents the community division condition of the FUA algorithm. It can be found that, on the Football network, the community division condition of the present invention only differs from the real community division condition by one node (the node in the square frame in the graph (b), the real community label is "0", and the community label divided by the present invention is "5"), while the FUA algorithm divides the network into 9 communities in error, and the division accuracy is low, which is greatly different from the real community division condition.
Therefore, compared with the traditional algorithm, the community division effect of the invention is more accurate and is closer to a real network.

Claims (10)

1. A community detection method for extracting a network core structure in combination with VAE is characterized in that a k-tress algorithm is used for finding a k-tress structure in a network, and the method specifically comprises the following steps:
step 1: carrying out data preprocessing on an original data set, establishing an adjacent matrix of a network diagram, and generating a similarity matrix of the adjacent matrix;
step 2: performing feature dimension reduction and feature extraction on the similarity matrix generated in the step (1) by using a variation automatic encoder model, wherein the variation automatic encoder model consists of a community information extraction module and a deep learning module;
and step 3: adjusting and training the variational automatic encoder model under a given deep learning framework by combining community information, and obtaining a low-dimensional characteristic matrix of a network structure;
and 4, step 4: and 3, clustering the low-dimensional feature matrix obtained in the step 3 by using a K-means algorithm to finally obtain a community division result.
2. The community detection method for extracting network core structure in combination with VAE according to claim 1,
the step 1 specifically comprises the following steps:
1.1: acquiring data in an original data set, and establishing an original network diagram;
1.2: establishing an adjacency matrix for the original network diagram in the step 1.1;
1.3: calculating the support degree of each edge in the original network graph in the step 1.1;
1.4: calculating a k-tress core structure in the original network diagram in the step 1.3;
1.5: calculating the similarity of each node according to the adjacency matrix in the step 1.2 and the k-tress core structure in the step 1.4;
1.6: and (4) according to the adjacent matrix in the step 1.2 and the similarity of each node in the step 1.5, calculating and generating a similarity matrix.
3. The method for community detection of extracted network core structure in combination with VAE according to claim 2,
in step 1.2, given graph G ═ V, E, where V is the set of nodes in graph G and E is the set of edges in graph G, let n (u) be the set of adjacent nodes to node u, let matrix a ═ a ij ] n×n For the adjacency matrix of FIG. G, n represents the dimension of the matrix, when a ij When 1, it represents an edge e ij Exist when a ij When equal to 0, represents an edge e ij Is absent.
4. The method for community detection of extracted network core structure in combination with VAE according to claim 3,
in step 1.5, for graph G ═ (V, E), i,j∈V,e ij e, adjacency matrix A ═ a ij ] n×n Edge e ij Support degree of (e) ij ) Edge e ij The similarity of (c) defines:
Figure FDA0003628554910000021
5. the method for community detection of extracted network core structure in combination with VAE according to claim 4,
in step 1.6, for graph G ═ (V, E), its similarity matrix is defined as X ═ X ij ] n×n Wherein x is ij =Sim(e ij ),i,j∈V,e ij ∈E。
6. The method for community detection of extracted network core structure in combination with VAE according to claim 1,
the step 2 specifically comprises the following steps:
2.1: establishing a variation automatic encoder model;
2.2: inputting the similarity matrix obtained in the step 1 into a coding part of a variational automatic coder model according to a formula H i =σ(W 1 X i +b 1 ) Calculating the forward propagation process to obtain a hidden layer representation H i I represents any one of numbers 1 to | V |;
2.3: extracting each hidden layer representation H in step 2.2 i Mean and logarithmic variance of;
2.4: sampling the random parameter ε from a standard normal distribution N (0,1) i Update H i =μ ii ×σ i Will update the H i As input to the decoding section;
2.5: h obtained in step 2.4 i Input into the decoding part of the variational automatic encoder model according to the formula Y i =σ(W 2 H i +b 2 ) Calculating the forward propagation process to obtain data and generating Y i ,Y i Represents the ith column vector of the output layer, i represents any number from 1 to | V |;
2.6: updating neural network parameter variables W through an error back propagation algorithm 1 ,W 2 ,b 1 ,b 2
2.7: when the training target required by the model is reached, continuing the next step, otherwise, repeatedly executing the step training step;
2.8: when the training times required by the model are reached, continuing the next step, otherwise, repeatedly executing the training step;
2.9: and when the requirements of the step 2.7 and the step 2.8 are met, obtaining the low-dimensional feature matrix trained by the variational automatic encoder model.
7. The method for community detection of extracted network core structure in combination with VAF according to claim 6,
in step 2.6, the cross entropy loss function in the variational automatic encoder is defined as:
Figure FDA0003628554910000031
wherein V is the number of network graph vertexes, X i Is the ith column vector, Y, of the matrix in the input layer i Is the ith column vector of the matrix in the output layer, i represents any number from 1 to | V |;
KL divergence function L 2 Is defined as:
Figure FDA0003628554910000041
wherein, mu i Is taken as the mean value of the average value,
Figure FDA0003628554910000042
is variance, m is hidden layer H i I represents any number of 1 to | V |.
8. The method for community detection of extracted network core structure in combination with VAE according to claim 1,
the step 3 specifically comprises:
3.1 extracting a community information matrix C;
3.2, establishing a variational automatic encoder model, wherein the settings are the same as those in the step 2 except for the total loss function;
3.3, inputting the matrix obtained in the step 2.9 into the variational automatic encoder model in the step 3.2, and obtaining a low-dimensional feature matrix through training.
9. The method for community detection of abstracted network core architecture in combination with VAE according to claim 8,
in step 3.1, defining a community information matrix C:
for the network graph G, for the obtained community information matrix C ∈ R |V|×m Where m is the number of communities and each row vector of matrix C represents the probability that a different vertex belongs to a community.
10. The method for community detection of abstracted network core architecture in combination with VAE according to claim 9,
in step 3.2, add one layer and mean μ x Vectors μ of the same dimension c Namely the average value of the community information distribution,
Figure FDA0003628554910000051
is the variance. The corner mark x represents adjacency information, and the corner mark c represents community information. After community information is added, the new KL divergence loss function is changed into:
Figure FDA0003628554910000052
wherein m is a hidden layer H i I represents any number of 1 to | V |.
CN202210483834.7A 2022-05-05 2022-05-05 Community detection method for extracting network core structure by combining VAE Pending CN114970684A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210483834.7A CN114970684A (en) 2022-05-05 2022-05-05 Community detection method for extracting network core structure by combining VAE

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210483834.7A CN114970684A (en) 2022-05-05 2022-05-05 Community detection method for extracting network core structure by combining VAE

Publications (1)

Publication Number Publication Date
CN114970684A true CN114970684A (en) 2022-08-30

Family

ID=82980867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210483834.7A Pending CN114970684A (en) 2022-05-05 2022-05-05 Community detection method for extracting network core structure by combining VAE

Country Status (1)

Country Link
CN (1) CN114970684A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964626A (en) * 2022-10-27 2023-04-14 河南大学 Community detection method based on dynamic multi-scale feature fusion network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964626A (en) * 2022-10-27 2023-04-14 河南大学 Community detection method based on dynamic multi-scale feature fusion network

Similar Documents

Publication Publication Date Title
CN109918528A (en) A kind of compact Hash code learning method based on semanteme protection
CN107562812A (en) A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN112308115B (en) Multi-label image deep learning classification method and equipment
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111222847B (en) Open source community developer recommendation method based on deep learning and unsupervised clustering
CN112732921B (en) False user comment detection method and system
CN115688024B (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Mohammadi et al. Improving linear discriminant analysis with artificial immune system-based evolutionary algorithms
Chen et al. Binarized neural architecture search for efficient object recognition
CN116386899A (en) Graph learning-based medicine disease association relation prediction method and related equipment
CN117349743A (en) Data classification method and system of hypergraph neural network based on multi-mode data
Li et al. Time series clustering based on normal cloud model and complex network
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN114970684A (en) Community detection method for extracting network core structure by combining VAE
Ahan et al. Social network analysis using data segmentation and neural networks
CN115238075B (en) Text sentiment classification method based on hypergraph pooling
CN115936014B (en) Medical entity code matching method, system, computer equipment and storage medium
Li et al. Evaluating BERT on cloud-edge time series forecasting and sentiment analysis via prompt learning
CN114840717B (en) Graph data-oriented mining method and device, electronic equipment and readable storage medium
CN111767825B (en) Face attribute invariant robustness face recognition method and system
CN115801152A (en) WiFi action identification method based on hierarchical transform model
Zhao et al. PENet: A phenotype encoding network for automatic extraction and representation of morphological discriminative features
CN114818681A (en) Entity identification method and system, computer readable storage medium and terminal
Phuc et al. Using SOM based graph clustering for extracting main ideas from documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination