CN113095948A

CN113095948A - Multi-source heterogeneous network user alignment method based on graph neural network

Info

Publication number: CN113095948A
Application number: CN202110315041.XA
Authority: CN
Inventors: 刘晓明; 尹伟; 沈超; 周亚东; 管晓宏
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-07-09
Anticipated expiration: 2041-03-24
Also published as: CN113095948B

Abstract

The invention discloses a multi-source heterogeneous network user alignment method based on a graph neural network, which comprises the steps of extracting attribute features of users through a text processing algorithm, and reducing the dimension of user attribute feature vectors by using a dimension reduction algorithm; obtaining a network topological structure by using a random walk algorithm, taking the attribute characteristics after dimensionality reduction and the network structure as the input of a graph neural network, and learning to obtain identity characteristics containing user attribute and structure information; calculating the similarity of the user names and the social roles among the cross-networks, and finding out candidate user pairs; and calculating and aligning the identities of the users of the multi-source heterogeneous network by using a neural network model. The method can be used for aligning the user identity of the social network, has important application in the fields of social network analysis, human and object image completion and the like, has low calculation complexity of the algorithm and high expandability, can align the user identity in the complex network, and has strong applicability to real data.

Description

Multi-source heterogeneous network user alignment method based on graph neural network

Technical Field

The invention belongs to the field of social media data mining, and particularly relates to a multi-source heterogeneous network user alignment method based on a graph neural network.

Background

In recent years, users have registered to use many online social networks, such as Twitter, Instagram, linkedln, etc., and due to the diversity of functions, different online social network platforms attract users for various purposes, such as information lookup/sharing and maintaining social relationships. For example, a user may use Twitter to post opinions about political events while sharing their leisure activities using Instagram. To better utilize the services provided by each social network, users tend to join multiple online social networks. It has become increasingly popular for users to have accounts (also referred to as user identities) on multiple social networks. According to a report of a social media study, 93% of Instagram users are participating in Facebook at the same time, while 53% of Twitter users also use Instagram.

The method has important significance in aligning the user identities of the multi-source heterogeneous network. As the act of users registering accounts on multiple social media sites has become increasingly popular, new opportunities and challenges have been created for various data mining and learning tasks. First, a user who has accounts on multiple social media websites has the potential to fully understand the interests of the user and provide better suggestions or services. Since users use different online social networks for different purposes, analyzing user identities on a single social media may not be fully aware of their personalities and interests. However, if we can link a person's user identity to multiple data sets, collecting and analyzing their data on these social media websites together, we may have a more comprehensive understanding of the user and provide better services. Second, users with accounts on multiple social media websites can enable us to integrate patterns between online social websites and solve some problems that cannot be solved by data from only one website. Data sparsity issues, for example, in many predictive tasks; for example, a newly established social media service may not have enough historical data to recommend to the user. If we can identify these users on other sophisticated social media websites, knowledge can be transferred from mature social media to new social media, mitigating issues of data sparsity, etc. Finally, users who have accounts on multiple social media websites may also help analyze user migration patterns and guide Web development. Users migrating from one social network to another typically reflect the user experience of Web development. User identity alignment across different social media sites provides a great opportunity to study usage migration behavior.

Although some platforms provide a function of unifying user IDs, many people do not fill in correct information for security. How to align the personal identity of a user in various social networks is a research focus.

The graph neural network is a method for processing graph domain information based on deep learning, and can capture relevant information such as the relationship between vertexes in a graph, a network topological structure and the like. Because of their better performance and interpretability, graphical neural networks have recently become a widely used method of graph analysis.

Disclosure of Invention

Aiming at solving the defects in the prior art, aiming at the problems of user anonymity, content heterogeneity, network diversity and the like in a social network, the invention aims to provide a multisource heterogeneous network user alignment method based on a graph neural network.

The invention is realized by the following technical scheme.

A multi-source heterogeneous network user alignment method based on a graph neural network comprises the following steps:

step 1, preprocessing an online social network, defining friends or concern relations between users of the social network, and constructing an undirected and unauthorized social network topological graph;

step 2, extracting attribute characteristics of the user through a text processing algorithm, and reducing dimensions through a principal component analysis method;

step 3, carrying out random walk sampling on all nodes of the social network topological graph to obtain the correlation of the nodes among the social networks;

step 4, taking the node attribute features after dimensionality reduction as input, training a graph neural network according to social network structure information obtained by random walk, and extracting identity features containing user attributes and structure information for each node in the topological graph through the graph neural network;

step 5, calculating the similarity of the user name and the social role among the cross-social networks, and finding out a candidate user pair according to an empirical value;

and 6, splicing the obtained identity characteristics of the user pairs crossing the social networks to serve as input of a neural network, training a multilayer neural network model according to the users with known aligned identities, and aligning the identities of the users in the candidate user pairs.

With respect to the above technical solutions, the present invention has a further preferable solution:

preferably, representing users of the social network as nodes, representing friends or concern relations among the users as edges, and constructing an undirected and unauthorized topological graph; the undirected topology graph is represented as G ═ V, E, where V is the set of nodes in the social network and E is the set of edges in the topology graph G.

The specific method of the step 2 comprises the following steps:

step 2.1, preprocessing the user attribute information, removing fixed labels or punctuation marks, and separating each word through a blank space; removing stop words in the stop phrases provided by the NLTK to finally obtain a word bag document;

step 2.2, calculating to obtain an attribute feature vector of each user by using a TF-IDF algorithm;

and 2.3, reducing the dimension of the attribute characteristics of the user by using a principal component analysis method.

The specific method of the step 3 comprises the following steps:

step 3.1, performing multiple random walks with fixed length on each node in the social network topological graph G, and recording a traversal node sequence;

and 3.2, sampling the first-order neighbors and the second-order neighbors of each node in the social network topological graph G in a fixed quantity, if the number of the actual neighbors of a certain node is less than the number of required samples, adopting a sampling method with a return, and otherwise, adopting a sampling mode without the return.

In the step 3.1, a random walk algorithm is executed on each node in the social network topological graph G, one node is selected as a next hop from neighbors of a current node at a medium probability each time, the step length is t, the random walk times are γ, and finally the mutual relation between the nodes is obtained. The preset length t of the random walk step is 5, and the preset length γ of the random walk times for each node is 50.

In the step 3.2, according to the feature of aggregating neighbors in two hops, the sampling number of the first-order neighbors is k1, and the sampling number of the second-order neighbors is k 2. The preset length k1 of the first order neighbor sample number is 25, and the preset length k2 of the second order neighbor sample number is 15.

The specific method of the step 4 comprises the following steps:

step 4.1, inputting the attribute characteristics after dimension reduction into a two-layer graph neural network graph, carrying out k-th aggregation on the target node by using an average aggregation mode, averaging each dimension of the sampling neighbor feature vector of the target node, splicing the obtained result with the k-1 layer vector of the target vertex, and carrying out nonlinear transformation, wherein the activation function is RELU;

step 4.2, constructing a loss function by using an unsupervised training mode;

and 4.3, after training is finished, obtaining the user identity characteristic of each node, and for the newly added node in the social network, using the attribute information and the neighbor information of the node as input, and obtaining the identity characteristic through fast calculation of a graph neural network.

The specific method of the step 5 comprises the following steps:

step 5.1, calculating the similarity of the user names by using the editing distance, namely the minimum editing times for changing one character string into another character string;

step 5.2, calculating the score of each node in the network by using a PageRank algorithm, sorting according to the score, and expressing the social status of each user by percentage;

and 5.3, selecting the nodes with the user name editing distance within 5 and the social ranking within 25% as candidate user pairs.

The specific method of the step 6 comprises the following steps:

step 6.1, constructing a multilayer neural network model, splicing the identity characteristics of user pairs across the social network, using a two-dimensional vector as input of the neural network model, using [1,0] to represent that two users across the social network are the same person, and using softmax as an activation function to construct a loss function L, wherein [0,1] represents that the two users are not the same person;

step 6.2, taking the user pairs with the aligned identities as positive samples with the aligned identities, and constructing negative samples with the unaligned identities by a random extraction method, wherein the proportion of the positive samples to the negative samples is 1: 1, fitting a model after multiple rounds of training by using a random gradient descent method;

and 6.3, inputting the identity characteristics of the candidate user pairs into the trained neural network, judging whether the candidate user pairs are the same user, and aligning the two social networks.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

(1) according to the method and the device, the user identity characteristics are extracted through the user name, the attribute information and the social network structure information of the user, so that the user identity between the cross-networks is aligned, the required data are all public data and are easy to obtain, and the method and the device are suitable for various online social networks.

(2) The invention acquires the structural information of the network by random walk, trains the neural network of the graph by an unsupervised method, simultaneously learns the node attribute information and the structural information end to end, acquires the identity characteristics of the user by aggregating the identity characteristics of the multi-layer friends of the user, and fully considers the actual characteristics of the social network.

(3) The graph neural network used by the invention is a direct push type learning mode, a new user is added in the social network, and the graph neural network can be directly obtained by the identity feature aggregation of friends, so that the calculation cost is saved, and the network does not need to be trained repeatedly.

(4) The alignment model is a multilayer neural network, can learn to obtain a special relationship between two networks, and is suitable for any social network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention:

FIG. 1 is a schematic flow chart of a multi-source heterogeneous network user alignment method based on a graph neural network according to the present invention;

FIG. 2 is a schematic diagram of a node neighbor sampling manner in the neural network;

fig. 3 is an aggregation manner of nodes in the neural network of the figure.

Detailed Description

The present invention will now be described in detail with reference to the drawings and specific embodiments, wherein the exemplary embodiments and descriptions of the present invention are provided to explain the present invention without limiting the invention thereto.

As shown in fig. 1, the multi-source heterogeneous network user alignment method based on the graph neural network provided by the embodiment of the present invention includes the following steps:

step S1: multi-source heterogeneous social network data pre-processing

The online social network data set in the embodiment is derived from https:// www.aminer.cn/cosnet and comprises two parts, namely a social network data set and an academic data set, wherein user alignment data in the social network is collected by Perito et al through a Google resume service system. Selecting two social networks of Twitter and MySpace in a social network data set for experiment, wherein the social network Twitter data set comprises 40,171,624 users and 1,468,365,182 edges and contains 28,199 parts of detailed user data; the social network MySpace dataset includes 854,498 users and 6,489,736 edges, with 9,993 shares of the detailed user profile.

Preprocessing an online social network, defining friends or concern relations between users of the social network, and constructing an undirected and unauthorized social network topological graph. Representing users of the social network as nodes, representing friends or attention relations among the users as edges, and constructing a topological graph G of the undirected weightless graph as (V, E), wherein V is a set of nodes in the social network, and E is a set of edges in the topological graph G.

Step S2: and extracting attribute features of the user through a text processing algorithm, and reducing the dimension through a principal component analysis method for subsequent input. The method specifically comprises the following steps:

step S201: the user attribute information is preprocessed, fixed labels or punctuation marks are removed, each word is separated by a space, for example, the user name of the user is Black-Hawk, the processing result is 1.Black 2.Hawk, the address filled by the user is "Georgia Tech, Atlanta, Georgia", and the processing result is 1.Georgia 2.Tech 3.Atlanta 4. Georgia. In addition, the stop words in the stop phrases provided by the NLTK are removed, and finally the bag-of-words document is obtained.

Step S202: and calculating to obtain the attribute feature vector of each user by using a TF-IDF algorithm.

Step S203: and reducing the dimension of the attribute features of the users by using a principal component analysis method, wherein the attribute features of the users occupy 64-dimensional features.

Step S3: all nodes of the social network topological graph G are subjected to random walk to obtain the relevant relations of the nodes among the social networks, and the relevant neighbors of each node are sampled, and the method specifically comprises the following steps:

step S301: and executing a random walk algorithm on each node in the social network topological graph G, selecting one node from the neighbors of the current node at a medium probability each time as a next hop, recording a traversal node sequence, and finally obtaining the mutual relation between the nodes, wherein the step length is 5, and the random walk times are 50.

Step S302: the used graph neural network mainly considers the characteristic of aggregating neighbors in two hops, a fixed number of sampling is carried out on a first-order neighbor (a directly connected node) and a second-order neighbor (a neighbor) of each node in a social network topological graph G, the sampling number of the first-order neighbor is 25, the sampling number of the second-order neighbor is 15, if the actual neighbor number of a certain node is less than the required sample number, a sampling method with a release is adopted, otherwise, a sampling mode without the release is adopted, for example, as shown in FIG. 2, the sampling number of the first-order neighbor is 3, the sampling number of the second-order neighbor is 4, the sampling nodes of the first-order neighbor of the node 1 are 2,3 and 4, and the sampling nodes of the second-order neighbor are.

Step S4: taking the node attribute features after dimensionality reduction as input, training a graph neural network according to social network structure information obtained by random walk, and extracting identity features containing user attributes and structure information for each node in a topological graph through the graph neural network, wherein the method specifically comprises the following steps:

step S401: inputting the attribute features after dimensionality reduction into a two-layer graph neural network GraphSAGE, and using an average aggregation mode to perform k-th aggregation on a target node, wherein the calculation step of performing k-th aggregation on the target node comprises the steps of averaging each dimensionality of a sampling neighbor feature vector of the target node, splicing the obtained result with the k-1-th layer vector of a target vertex, performing nonlinear transformation, obtaining an activation function RELU, and performing aggregation mode of node identity features as shown in FIG. 3, wherein the features of the nodes in the graph are all k-1-th aggregation results, and the k-th aggregation step of the node 1 is to firstly average the features of neighbors 2,3 and 4 of the node 1, namely [ (2+3+1)/3, (4+ 3)/3, (2+6+3)/3]^T＝[2,3,3.67]^TAnd then the characteristics of the node 1 are spliced with the characteristics to obtain [4,2,5,2,3,3.67 ]]^TThen multiplied by the matrix W to be trained^kAnd after the function is activated, the identity characteristic of the node 1 can be obtained.

Step S402: using an unsupervised training approach, a loss function is constructed:

wherein Z is_uAn embedded vector is generated for the node u through a model, the node v is a neighbor that the node u reaches through random walk, and Z_vFor the embedded vector of node v, σ stands for sigmoid activation function, P_nThe probability distribution of negative sampling is shown, Q is the number of negative samples, an Adam optimizer is used for training, and fitting is achieved through a multi-round training graph neural network.

Step S403: and after training is finished, obtaining the user identity characteristic of each node, regarding the newly added node in the social network, taking the attribute information of the node and the neighbor attribute information as input, and obtaining the identity characteristic through calculation of a graph neural network.

Step S5: calculating the similarity of user names and social roles among the cross-social networks, and finding out candidate user pairs according to experience values, wherein the method specifically comprises the following steps:

step S501: the user name similarity is calculated using the edit distance, i.e. the minimum number of edits required to change one string to another. The user name edit distance calculation process is shown in table 1.

TABLE 1

Step S502: and calculating the score of each node in the network by using a PageRank algorithm, sorting according to the scores, and expressing the social status of each user by percentage.

Step S503: and selecting the nodes with the user name editing distance within 5 and the social ranking within 25% as candidate user pairs.

Step S6: splicing the obtained identity characteristics of the user pairs crossing the social network as input of a neural network, training a multilayer neural network model according to the users with known identity alignment, and aligning the identity of the user in the candidate user pairs, specifically comprising the following steps:

step S601: constructing a multi-layer neural network model, splicing the identity characteristics of user pairs between networks to serve as the input of the neural network model, using a two-dimensional vector as the output, [1,0]]Representing two users across the network as the same person, [0,1]Then it means that the two users are not the same person, the activation function uses softmax, constructs the loss function L:

where y represents the true result of the input user pair,

representing the predicted outcome.

Step S602: the user alignment problem is a two-classification problem, and comprises two types of positive samples (identity aligned samples) and negative samples (identity unaligned samples), wherein the identity aligned user pair is used as the positive sample, the negative sample is constructed by a random extraction method, and the proportion of the positive sample to the negative sample is 1: 1, using a random gradient descent method, wherein the learning rate lambda is 0.001, and the model is fitted after multiple rounds of training.

Step S603: and inputting the identity characteristics of the candidate user pairs into the trained neural network, judging whether the candidate user pairs are the same user, and aligning the two networks.

In order to verify the effect of the user alignment method based on the graph neural network proposed in the present embodiment, the test set in the present embodiment is tested, and the accuracy and the recall rate are used as the evaluation indexes of the model.

The experimental results of this example are as follows:

the accuracy values of the test set stabilized at 0.7796 and the recall stabilized at 0.7248.

The experimental result shows that the user identity across networks can be aligned effectively and accurately by the user alignment method based on the graph neural network. The method can be used for aligning the user identities among a plurality of social networks, has important application in a plurality of fields of social network data analysis, figure portrait missing information completion, social platform commodity recommendation and the like, has low calculation complexity and high expandability of an algorithm, can align the user identities in the complex networks, and is suitable for the plurality of social networks.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A multi-source heterogeneous network user alignment method based on a graph neural network is characterized by comprising the following steps:

2. The multi-source heterogeneous network user alignment method based on the graph neural network is characterized in that the users of the social network are represented as nodes, friends or concern relations among the users are represented as edges, and a undirected and unweighted topological graph is constructed; the undirected topology graph is represented as G ═ V, E, where V is the set of nodes in the social network and E is the set of edges in the topology graph G.

3. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 1, wherein the specific method in the step 2 is as follows:

4. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 1, wherein the specific method in the step 3 is as follows:

5. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 3, characterized in that in step 3.1, a random walk algorithm is executed on each node in the social network topological graph G, one node is selected as a next hop from neighbors of a current node at a medium probability each time, the step length is t, the random walk times are γ, and finally the mutual relationship between the nodes is obtained;

in the step 3.2, according to the feature of aggregating neighbors in two hops, the sampling number of the first-order neighbors is k1, and the sampling number of the second-order neighbors is k 2.

6. The multi-source heterogeneous network user alignment method based on the graph neural network as claimed in claim 5, wherein the preset length t of the random walk step is 5, and the preset length γ of the random walk times for each node is 50.

7. The multi-source heterogeneous network user alignment method based on the graph neural network as claimed in claim 5, wherein the preset length k 1-25 for the first-order neighbor sample number and the preset length k 2-15 for the second-order neighbor sample number.

8. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 1, wherein the specific method in the step 4 is as follows:

step 4.2, constructing a loss function by using an unsupervised training mode:

wherein Z is_uAn embedded vector is generated for the node u through a model, the node v is a neighbor that the node u reaches through random walk, and Z_vFor the embedded vector of node v, σ stands for sigmoid activation function, P_nIs the probability distribution of negative samples, Q is the number of negative samplesMesh;

9. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 1, wherein the specific method in the step 5 is as follows:

10. The multi-source heterogeneous network user alignment method based on the graph neural network according to claim 1, wherein the specific method in the step 6 is as follows:

step 6.1, constructing a multilayer neural network model, splicing the identity characteristics of user pairs across social networks, using a two-dimensional vector as input of the neural network model, using [1,0] to represent that two users across the social networks are the same person, using [0,1] to represent that the two users are not the same person, using softmax as an activation function, and constructing a loss function L:

where y represents the true result of the input user pair,

representing the predicted result；