CN109635989B

CN109635989B - Social network link prediction method based on multi-source heterogeneous data fusion

Info

Publication number: CN109635989B
Application number: CN201810999492.8A
Authority: CN
Inventors: 周帆; 钟婷; 吴帮莹
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2022-03-29
Anticipated expiration: 2038-08-30
Also published as: CN109635989A

Abstract

The invention discloses a method for social network link prediction based on multi-source heterogeneous data fusion, which utilizes a social network based on geographic location information including two heterogeneous data sources, a user relationship topology map and a user check-in record, to link links. predict. The present invention proposes a hybrid framework, which fully captures the relationship between the two heterogeneous data sources, the user relationship topology map and the user check-in record, in the social network based on geographic location information through the model AL, and overcomes the use of geographic location information-based social networking. When a single data source in the network performs link prediction, the problem of inaccurate prediction results effectively improves the effect of link prediction. At the same time, locality-sensitive hashing is applied to improve the computational speed of deep learning training and reduce storage overhead.

Description

Social network link prediction method based on multi-source heterogeneous data fusion

Technical Field

The invention belongs to the field of neural Networks in machine learning, and relates to a deep learning-Based method, in particular to a method for fusing two heterogeneous data, namely a user relationship topological graph and a user check-in record in a Social network (LBSN) Based on geographical position information by utilizing deep learning to realize Social network link prediction, and improving the calculation speed of deep learning training and reducing the storage cost by using Local Sensitive Hashing (LSH).

Background

Link Prediction (LP), Link Prediction for short, aims to find out a missing edge or an edge which will appear in the future in a user relationship topological graph formed by friend relationships. With the rapid growth of Social Network Services (SNS) and other Network applications, Network data is ubiquitous. Network data such as friend relationships on APPs such as Facebook and QQ are obtained, a user relationship topological graph can be constructed, and the user relationship topological graph can be used for social network link prediction. Meanwhile, with the development of positioning technology, the GPS positioning function of the mobile equipment can be used for collecting the position information of the user, and the position information can be combined with the positioning time to form a user check-in record. Many studies have shown that user check-in records also contribute to social network link prediction.

The link prediction plays an important role in an information recommendation system, is mainly used in social network analysis, can obtain friends with higher confidence degree through the link prediction, recommends the friends possibly known by a user, can remarkably improve the social experience and loyalty of the user, and brings great economic benefits to enterprises. Besides predicting user association in a user relationship topological graph, the link prediction method and the link prediction idea can be used for predicting the type of a label-free node in a network with known partial node types, and the method and the idea have great value for network recombination and structural function optimization.

In the conventional link prediction method, the similarity between two user nodes is usually measured by using a Jaccard, euclidean distance or cosine value, so as to determine whether the link exists. None of these methods is flexible enough. If a new data set is replaced, or data is added or deleted from the original data set, all data needs to be recalculated, and a large amount of computing and storage resources are consumed. The deep learning can flexibly process mass data. The link prediction model built based on the deep learning method can optimize the parameters of the model by inputting massive training data so as to obtain the trained model for prediction work.

Disclosure of Invention

The LBSN data set used by the invention comprises two data sources with different structures, namely a user relation topological graph and a user check-in record. The user relationship topological graph is formed by the relationship among users, the relationship among the users is called as a link (namely a point pair), and each link is formed by the relationship of two user nodes. The user check-in record is composed of a check-in user node, a check-in Point longitude, a check-in Point latitude, a check-in time and a Point-of-Interest (POI).

The invention aims to solve the problem of inaccurate prediction result when a single data source in LBSN is used for link prediction. The basic idea of the invention is to provide a hybrid framework, which fuses two heterogeneous data, namely a user relationship topological graph and a user check-in record in LBSN (location based service), so as to realize link prediction and enhance the prediction effect of the existing link prediction method. And meanwhile, the LSH is adopted to improve the performance of calculation and storage.

Based on the above invention thought, the invention provides a social network link prediction method based on multi-source heterogeneous data fusion, which comprises the following steps:

s1, Data _ process (g) → Tra, Tes: and extracting a training set Tra and a test set Tes from the user relation topological graph G ═ V, E. Wherein V represents the set of user nodes in the topology graph and E represents the set of edges in the topology graph. If two users u in G_iAnd u_jIf there is a social relationship, there is an edge between them, denoted as e_ij＝(u_i,u_j)；

S2，

Learning and acquiring a social network user vector V from a positive sample G' of Tra by adopting a network representation learning method, and recording the social network user vector V as

Wherein d is

Dimension (d);

S3，

constructing a user-position check-in frequency matrix according to the user check-in record S ═ U, L

Wherein U and L respectively represent a user set and a check-in point set in S, N is the number of users in U, and M is the number of check-in points in L. And then obtaining a user access preference vector in a low-dimensional vector space by using Poisson matrix decomposition, and recording the user access preference vector as

Wherein D is

Dimension (d);

S4，

to capture the association of these two types of data sources in the LBSN, in a manner similar to anchor link (anchor link), an improved deep learning model, called AL, was designed.

As a sample, a sample of,

as the label corresponding to the sample, the two vectors are input into AL together for multi-round training. Generating a new user access preference vector fused with topology information in G using the final trained AL

S5，

Will be provided with

And u_i ^'vAnd performing fusion again, and inputting the fused signal into a Convolutional Neural Network (CNN) for training. And finally, inputting the Tes into the trained CNN for link prediction to obtain a prediction result.

In the above method for predicting a social network link based on multi-source heterogeneous data fusion, the step S1 is to obtain Tra and Tes. Link prediction can be viewed as a binary problem, with links present in G being viewed as positive samples and links not present being viewed as negative samples. The positive samples in Tra are the user relationship topological graph G' epsilon G missing part of the links, and the missing links are used as the positive samples of Tes. The method specifically comprises the following steps:

s11, data cleaning is carried out, so that users in G and S in LBSN are consistent;

s12, select some links from G as Tes positive samples. Meanwhile, the G' belonging to G after the Tes positive sample is removed from G is ensured to be communicated; taking G' as a positive sample of Tra;

and S13, randomly selecting some nonexistent links from G as negative samples, and distributing the links into Tra and Tes according to a predefined proportion.

In the above method for predicting social network links based on multi-source heterogeneous data fusion, the step S3 is to obtain

The method specifically comprises the following steps:

s31, construction of H using S. Wherein the row of H represents the user, the column represents the POI, and the value of H is filled by the number of times that the user accesses the corresponding POI;

s32, performing Poisson matrix decomposition on H to obtain matrix reflecting user access preference

And POI feature matrix

The POI feature matrix can reflect the condition that a certain POI is visited by a user. U shape_sAs a line of

In the above method for predicting the social network link based on multi-source heterogeneous data fusion, the step S4 is to capture the association between G and S in the lbs n, so as to implement fusion. In order to capture such a correlation, the training of the model AL specifically comprises the following sub-steps:

s41, utilizing

Representing the user nodes in the Tra, and calculating the cosine mean cos of the user point pairs in the Tra_ori；

S42, capturing the association between G and S by using the one-to-one correspondence of users in V and U. Mixing the sample

And corresponding label

Dividing the data into a plurality of batches (batch) and circularly inputting the batches into a Multi Layer Performance (MLP) for training;

and S43, optimizing the parameters in the model AL through multiple rounds of training. After AL is trained, will

Input AL, output u_i ^'v。

The implementation method of the model AL described above involves two calculation functions in step S42. The first calculation function is a mapping function for capturing the one-to-one correspondence relationship between users in V and U, and is recorded as

The mapping function corresponds to a loss function of

Where x represents a sample, y represents a true value, a represents an output value of the model, and n represents the number of samples. A random gradient descent algorithm is called to optimize a global weight parameter W and a global deviation parameter b, and the optimization processes are respectively recorded as

Where σ represents the activation function and z is the input to the neuron, expressed as

The second calculation function is to ensure that u is generated_i ^'vNo offset, i.e. use of u_i ^'vThe mean value of the cosine of the calculation of the user point pair in the Tra is not less than cos_ori. Therefore, a cosine mean constraint limit is introduced, noted

Wherein

And

respectively represent users u_mAnd user u_nIs given, and e is present in G_mn. N (U) represents the number of users in U. The cosine mean constraint limits the corresponding loss function to

The tuning processes of the global weight parameter W and the global bias parameter b are respectively recorded as

In the above method for predicting a social network link based on multi-source heterogeneous data fusion, in step S5, G and S are fused again to realize link prediction with low storage consumption and high calculation speed. The method specifically comprises the following steps:

s51, mixing

And u_i ^'vSpliced into a vector

S52, applying LSH

Projecting into a binary directionQuantity m_i∈{0,1}^mUp, user u_iUsing m_iRepresents;

s53, for any edge e in G_ij＝(u_i,u_j) Obtaining m by the same method_jAs user u_jIs represented by (a);

s54, mixing m_iAnd m_jStitching to obtain edge e_ijIs represented by a binary vector of m_ij(∈{0,1}^2m)＝[m_i；m_j]；

S55, converting the vector m with the length of 2m_ijThe elements in (1) are sequentially filled into a square matrix with the size of n × n in the order of line priority, and the process is called reshaping. Then inputting the square matrix into a Convolutional Neural Network (CNN) for training;

and S56, inputting Tes into the trained CNN for link prediction.

Compared with the prior art, the invention has the following beneficial effects:

1. the method for predicting the social network link based on the multi-source heterogeneous data fusion generates a new user access preference vector fusing the user relationship topological graph by using the model AL similar to the anchor link, and can fully capture the association between the social network user vector and the user access preference vector. The AL can align the social network user vector and the user access preference vector of the same user, and can ensure that a new user access preference vector generated after the user vectors in the two data sources are fused does not shift by introducing a constraint mechanism of cosine mean.

2. The invention discloses a social network link prediction method based on multi-source heterogeneous data fusion, which is characterized in that a social network user vector and a user access preference vector which are respectively obtained by two data sources in an LBSN (location based service) N (location based service) are spliced, the spliced vector is projected onto a binary vector by applying the LSH (least significant Shift H), the binary vector is taken as a final representation vector of a user node in the LBSN, and the final vector of the user node is used for splicing and representing the point-to-point relation according to the point-to-point relation contained in a link in a user relation topological graph. And finally, reshaping the spliced vector into a square matrix, and inputting the square matrix into the CNN to realize link prediction. The application of the LSH can improve the computation speed of deep learning for training and reduce the storage overhead.

Drawings

FIG. 1 is an anchor-like link model AL for capturing associations between social network user vectors and user access preference vectors based on a multi-tier perceptron (MLP).

FIG. 2 is an overall model architecture of a social network link prediction method based on multi-source heterogeneous data fusion. Splicing the new user access preference vector fused with the user relationship topological graph generated in the graph 1 with the social network user vector, projecting the spliced vector to a binary vector by applying an LSH (least squares) algorithm, taking the binary vector as a final representation vector of the user nodes in the LBSN (location based service) N, and splicing the final vector of the user nodes to represent the point-to-point relationship according to the point-to-point relationship contained in the link in the user relationship topological graph. And finally reshaping the spliced vector into a square matrix, and inputting the square matrix into the CNN.

FIG. 3 is a comparison of the performance of the social network link prediction method based on multi-source heterogeneous data fusion when LSH is not used and when LSH is used. Wherein (a) is memory consumption comparison, (b) is CPU consumption comparison, (c) is GPU consumption comparison, and (d) is time consumption comparison. (d) Is the vector dimension input to CNN.

Interpretation of terms

LBSN is an abbreviation for Location-based Social Network, representing a "Location-based Social Network". The LBSN not only contains the contact between people in the traditional social network, but also records the time of the user sign-in, the geographic position and other information.

POI is an abbreviation for Point-of-Interest, representing a "Point of Interest". In lbs n, a POI is a place where a user checks in.

LSH is an abbreviation for local Sensitive Hashing, meaning "Locality Sensitive Hashing". The method is a rapid nearest neighbor search algorithm aiming at massive high-dimensional data.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Examples

The method for predicting the social network link based on the multi-source heterogeneous data fusion can be used for data sets comprising two data sources, namely a user relationship topological graph and a user check-in record. With the real-world LBSN data set shown in Table 1, such as Foursquare (available from Foursquare)http://snap.stanford.eduAcquisition) was performed as an example.

Table 1: relevant information of social link prediction training set for multi-source heterogeneous data fusion

Dataset	#check_ins	#POIs	#edges	＃users
					Foursquare@NYC	22,563	1,992	5,810	588
Foursquare@TKY	38,742	2,212	9,624	1,055
					Gowalla@DC	13,594	4,795	5,826	880
Gowalla@CHI	10,314	3,269	2,542	627
					Brightkite	75,522	4,038	33,008	1,502

As shown in fig. 1, step S1 is first substituted with the user relationship topology G in Foursquare: data _ process (G) → Tra, Tes, a training set Tra and a test set Tes are acquired. The positive sample G' in Tra is substituted into step S2:

wherein the social network user vector can be obtained using a common network learning representation method, such as node2vec

Next, step S3 is substituted first using the user check-in record S in Foursquare:

obtaining a user access preference vector

The outputs of steps S2 and S3

And

input into model AL, using step S4:

obtaining u_i ^'v。

Table 2: effect of social link prediction on three real data sets

FIG. 2 is an overall model architecture of a social network link prediction method based on multi-source heterogeneous data fusion.

As shown in FIG. 2, the outputs of steps S2 and S4

And u_i ^'vInput into the prediction model CNN, using step S5:

and acquiring a final prediction result. The link prediction effect of the hybrid model is shown in table 2.

# check _ ins indicates the number of user check-in records;

# POIs indicates the number of different POIs in the user check-in record;

# edges indicates the number of links in the user relationship topology;

# users represents the number of users in the user relationship topology graph (or user check-in record);

foursquare @ NYC represents data in the data set Foursquare with the area of New York City;

foursquare @ TKY represents data in the data set Foursquare with the region of Tokyo;

gowalla @ DC represents data in the data set Gowalla with regions of Washington;

gowalla @ CHI represents data in the data set Gowalla with the region Chicago;

vec2 link-is a method of social network link prediction based on multi-source heterogeneous data fusion without using LSH;

vec2link + is a social network link prediction method based on multi-source heterogeneous data fusion and using LSH, and compared with vec2link-, the method embodies the advantages that after LSH is used, storage occupation is low, and calculation speed is increased;

avage, Hadamard, Weighted-L1, Weighted-L2 are comparative methods of vec2link +, which only use the information of the user relationship topology map in LBSN, and the implementation scheme can be referred to the literature [ Grover, Aditya, and J Leskovec ] "node2vec: Scalable features learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and timing.

Jaccard is a vec2link + comparison method for comparing similarity and difference between finite sample sets;

walk2friend is a comparison method of vec2link +, which uses only data recorded by user check-in lbs n, and the implementation can be found in references [ backs, Michael, et al ] "Walk2friends: involved Social Links from Mobility profiles." Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications security.acm,2017 ].

From the test results in table 2, it can be seen that the prediction effect of the method for predicting the social network link based on the multi-source heterogeneous data fusion is comprehensively superior to that of the method for predicting the link only using a single data source.

Therefore, the method can effectively overcome the problem of inaccurate prediction result when the single data source in the LBSN is used for link prediction, and realizes the improvement of the link prediction effect. The invention adopts a deep learning method to fuse two heterogeneous data sources of a user relationship topological graph and a user check-in record in the LBSN to realize link prediction. And meanwhile, the LSH is adopted, and a discrete binary vector represents a user node, so that the calculation speed of the model is accelerated, and the storage overhead is saved. The invention achieves the purposes of high calculation speed, less storage consumption and better link prediction effect than single-source data prediction effect.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. a method for predicting social network links based on multi-source heterogeneous data fusion, is characterized in that comprising the following steps:

S1,

: From the user relationship topology map

The training set Tra and test set T es are extracted from ; where V represents the set of user nodes in the topology graph, and E represents the set of edges in the topology graph; if two users u _i and u _j in G have a social relationship, then There is an edge between them, denoted as e _ij =( u _i , u _j );

S2,

: Using the network representation learning method, learn and obtain the social network user vector of V from the positive sample G ′ of Tra , denoted as

, where d is

dimension;

S3,

: According to the user check-in record S = ( U , L ), construct a user-location check-in frequency matrix

; where U and L represent the user set and check-in point set in S respectively, N is the number of users in U , M is the number of check-in points in L ; then use Poisson matrix decomposition to obtain user access in low-dimensional vector space preference vector, denoted as

, where D is

dimension;

S4,

: In order to capture the association between the two types of data sources G and S in LBSN, an improved deep learning model called AL is designed by using the anchor link method;

As a sample,

As the labels corresponding to the samples, the two vectors are input into the AL together for multiple rounds of training; the final trained AL is used to generate a new user access preference vector that incorporates the topology information in G

;

S5,

:Will

and

Perform re-integration and input it to a convolutional neural network Convolutional Neural Network, or CNN for short, for training; finally, input T es into the trained CNN for link prediction, and obtain the prediction result .

2. the method for social network link prediction based on multi-source heterogeneous data fusion according to claim 1, is characterized in that described step S1 comprises following substep:

S11, perform data cleaning, so that the users in G and S in the LBSN are consistent;

S12, select some links from G as T es positive samples; at the same time ensure that the T es positive samples are removed from G

is connected; take G ′ as a positive sample of Tra ;

S13, randomly select some non-existing links from G as negative samples, and allocate them to Tra and T es according to a predefined ratio .

3. the method for social network link prediction based on multi-source heterogeneous data fusion according to claim 1, is characterized in that described step S3 comprises following substep:

S31, utilizes S to construct H ; Wherein, the row of H represents the user, and the column represents POI, and the value of H is filled by the number of times the user visits the corresponding POI;

S32, perform Poisson matrix decomposition on H to obtain a matrix reflecting the user's visit preference

and POI feature matrix

; The POI feature matrix can reflect the situation that a POI is accessed by users;

the line as

.

4. the method for social network link prediction based on multi-source heterogeneous data fusion according to claim 1, is characterized in that described step S4 comprises following substep:

S41, using

Indicates the user nodes in Tra , and calculates the cosine mean cos _ori of the pairs of user points in Tra;

S42, use the one-to-one correspondence between users in V and U to capture the association between G and S ;

and corresponding labels

Divide it into multiple batches and input them into the Multilayer Perceptron, or MLP for short, for training;

S43, the parameters in the model AL are optimized through multiple rounds of training; after the AL is trained, the

input AL, output

.

5. the method for social network link prediction based on multi-source heterogeneous data fusion according to claim 1, is characterized in that described step S5 comprises following substep:

S51, will

and

concatenated into a vector

;

S52, applying LSH will

Project to a binary vector

above _, user ui is _represented by mi ;

S53, for any edge e _ij =( u _i , u _j ) in G , adopt the same method to obtain m _j as the representation of user u _j ;

S54, concatenate m _i and m _j to obtain the binary vector representation of edge e _ij

;

S55, fill the elements in the vector m _ij with a length of 2 m into a square matrix of size n × n in row-first order, this process is called reshaping; then input this square matrix into a convolution Neural network Convolutional NeuralNetwork, referred to as CNN, for training;

S56, input Te into the trained CNN for link prediction.