Disclosure of Invention
The LBSN data set used by the invention comprises two data sources with different structures, namely a user relation topological graph and a user check-in record. The user relationship topological graph is formed by the relationship among users, the relationship among the users is called as a link (namely a point pair), and each link is formed by the relationship of two user nodes. The user check-in record is composed of a check-in user node, a check-in Point longitude, a check-in Point latitude, a check-in time and a Point-of-Interest (POI).
The invention aims to solve the problem of inaccurate prediction result when a single data source in LBSN is used for link prediction. The basic idea of the invention is to provide a hybrid framework, which fuses two heterogeneous data, namely a user relationship topological graph and a user check-in record in LBSN (location based service), so as to realize link prediction and enhance the prediction effect of the existing link prediction method. And meanwhile, the LSH is adopted to improve the performance of calculation and storage.
Based on the above invention thought, the invention provides a social network link prediction method based on multi-source heterogeneous data fusion, which comprises the following steps:
s1, Data _ process (g) → Tra, Tes: and extracting a training set Tra and a test set Tes from the user relation topological graph G ═ V, E. Wherein V represents the set of user nodes in the topology graph and E represents the set of edges in the topology graph. If two users u in GiAnd ujIf there is a social relationship, there is an edge between them, denoted as eij=(ui,uj);
S2,
Learning and acquiring a social network user vector V from a positive sample G' of Tra by adopting a network representation learning method, and recording the social network user vector V as
Wherein d is
Dimension (d);
S3,
constructing a user-position check-in frequency matrix according to the user check-in record S ═ U, L
Wherein U and L respectively represent a user set and a check-in point set in S, N is the number of users in U, and M is the number of check-in points in L. And then obtaining a user access preference vector in a low-dimensional vector space by using Poisson matrix decomposition, and recording the user access preference vector as
Wherein D is
Dimension (d);
S4,
to capture the association of these two types of data sources in the LBSN, in a manner similar to anchor link (anchor link), an improved deep learning model, called AL, was designed.
As a sample, a sample of,
as the label corresponding to the sample, the two vectors are input into AL together for multi-round training. Generating a new user access preference vector fused with topology information in G using the final trained AL
S5,
Will be provided with
And u
i 'vAnd performing fusion again, and inputting the fused signal into a Convolutional Neural Network (CNN) for training. And finally, inputting the Tes into the trained CNN for link prediction to obtain a prediction result.
In the above method for predicting a social network link based on multi-source heterogeneous data fusion, the step S1 is to obtain Tra and Tes. Link prediction can be viewed as a binary problem, with links present in G being viewed as positive samples and links not present being viewed as negative samples. The positive samples in Tra are the user relationship topological graph G' epsilon G missing part of the links, and the missing links are used as the positive samples of Tes. The method specifically comprises the following steps:
s11, data cleaning is carried out, so that users in G and S in LBSN are consistent;
s12, select some links from G as Tes positive samples. Meanwhile, the G' belonging to G after the Tes positive sample is removed from G is ensured to be communicated; taking G' as a positive sample of Tra;
and S13, randomly selecting some nonexistent links from G as negative samples, and distributing the links into Tra and Tes according to a predefined proportion.
In the above method for predicting social network links based on multi-source heterogeneous data fusion, the step S3 is to obtain
The method specifically comprises the following steps:
s31, construction of H using S. Wherein the row of H represents the user, the column represents the POI, and the value of H is filled by the number of times that the user accesses the corresponding POI;
s32, performing Poisson matrix decomposition on H to obtain matrix reflecting user access preference
And POI feature matrix
The POI feature matrix can reflect the condition that a certain POI is visited by a user. U shape
sAs a line of
In the above method for predicting the social network link based on multi-source heterogeneous data fusion, the step S4 is to capture the association between G and S in the lbs n, so as to implement fusion. In order to capture such a correlation, the training of the model AL specifically comprises the following sub-steps:
s41, utilizing
Representing the user nodes in the Tra, and calculating the cosine mean cos of the user point pairs in the Tra
ori;
S42, capturing the association between G and S by using the one-to-one correspondence of users in V and U. Mixing the sample
And corresponding label
Dividing the data into a plurality of batches (batch) and circularly inputting the batches into a Multi Layer Performance (MLP) for training;
and S43, optimizing the parameters in the model AL through multiple rounds of training. After AL is trained, will
Input AL, output u
i 'v。
The implementation method of the model AL described above involves two calculation functions in step S42. The first calculation function is a mapping function for capturing the one-to-one correspondence relationship between users in V and U, and is recorded as
The mapping function corresponds to a loss function of
Where x represents a sample, y represents a true value, a represents an output value of the model, and n represents the number of samples. A random gradient descent algorithm is called to optimize a global weight parameter W and a global deviation parameter b, and the optimization processes are respectively recorded as
Where σ represents the activation function and z is the input to the neuron, expressed as
The second calculation function is to ensure that u is generated
i 'vNo offset, i.e. use of u
i 'vThe mean value of the cosine of the calculation of the user point pair in the Tra is not less than cos
ori. Therefore, a cosine mean constraint limit is introduced, noted
Wherein
And
respectively represent users u
mAnd user u
nIs given, and e is present in G
mn. N (U) represents the number of users in U. The cosine mean constraint limits the corresponding loss function to
The tuning processes of the global weight parameter W and the global bias parameter b are respectively recorded as
In the above method for predicting a social network link based on multi-source heterogeneous data fusion, in step S5, G and S are fused again to realize link prediction with low storage consumption and high calculation speed. The method specifically comprises the following steps:
s51, mixing
And u
i 'vSpliced into a vector
S52, applying LSH
Projecting into a binary directionQuantity m
i∈{0,1}
mUp, user u
iUsing m
iRepresents;
s53, for any edge e in Gij=(ui,uj) Obtaining m by the same methodjAs user ujIs represented by (a);
s54, mixing miAnd mjStitching to obtain edge eijIs represented by a binary vector of mij(∈{0,1}2m)=[mi;mj];
S55, converting the vector m with the length of 2mijThe elements in (1) are sequentially filled into a square matrix with the size of n × n in the order of line priority, and the process is called reshaping. Then inputting the square matrix into a Convolutional Neural Network (CNN) for training;
and S56, inputting Tes into the trained CNN for link prediction.
Compared with the prior art, the invention has the following beneficial effects:
1. the method for predicting the social network link based on the multi-source heterogeneous data fusion generates a new user access preference vector fusing the user relationship topological graph by using the model AL similar to the anchor link, and can fully capture the association between the social network user vector and the user access preference vector. The AL can align the social network user vector and the user access preference vector of the same user, and can ensure that a new user access preference vector generated after the user vectors in the two data sources are fused does not shift by introducing a constraint mechanism of cosine mean.
2. The invention discloses a social network link prediction method based on multi-source heterogeneous data fusion, which is characterized in that a social network user vector and a user access preference vector which are respectively obtained by two data sources in an LBSN (location based service) N (location based service) are spliced, the spliced vector is projected onto a binary vector by applying the LSH (least significant Shift H), the binary vector is taken as a final representation vector of a user node in the LBSN, and the final vector of the user node is used for splicing and representing the point-to-point relation according to the point-to-point relation contained in a link in a user relation topological graph. And finally, reshaping the spliced vector into a square matrix, and inputting the square matrix into the CNN to realize link prediction. The application of the LSH can improve the computation speed of deep learning for training and reduce the storage overhead.
Drawings
FIG. 1 is an anchor-like link model AL for capturing associations between social network user vectors and user access preference vectors based on a multi-tier perceptron (MLP).
FIG. 2 is an overall model architecture of a social network link prediction method based on multi-source heterogeneous data fusion. Splicing the new user access preference vector fused with the user relationship topological graph generated in the graph 1 with the social network user vector, projecting the spliced vector to a binary vector by applying an LSH (least squares) algorithm, taking the binary vector as a final representation vector of the user nodes in the LBSN (location based service) N, and splicing the final vector of the user nodes to represent the point-to-point relationship according to the point-to-point relationship contained in the link in the user relationship topological graph. And finally reshaping the spliced vector into a square matrix, and inputting the square matrix into the CNN.
FIG. 3 is a comparison of the performance of the social network link prediction method based on multi-source heterogeneous data fusion when LSH is not used and when LSH is used. Wherein (a) is memory consumption comparison, (b) is CPU consumption comparison, (c) is GPU consumption comparison, and (d) is time consumption comparison. (d) Is the vector dimension input to CNN.
Interpretation of terms
LBSN is an abbreviation for Location-based Social Network, representing a "Location-based Social Network". The LBSN not only contains the contact between people in the traditional social network, but also records the time of the user sign-in, the geographic position and other information.
POI is an abbreviation for Point-of-Interest, representing a "Point of Interest". In lbs n, a POI is a place where a user checks in.
LSH is an abbreviation for local Sensitive Hashing, meaning "Locality Sensitive Hashing". The method is a rapid nearest neighbor search algorithm aiming at massive high-dimensional data.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Examples
The method for predicting the social network link based on the multi-source heterogeneous data fusion can be used for data sets comprising two data sources, namely a user relationship topological graph and a user check-in record. With the real-world LBSN data set shown in Table 1, such as Foursquare (available from Foursquare)http://snap.stanford.eduAcquisition) was performed as an example.
Table 1: relevant information of social link prediction training set for multi-source heterogeneous data fusion
Dataset
|
#check_ins
|
#POIs
|
#edges
|
#users
|
Foursquare@NYC
|
22,563
|
1,992
|
5,810
|
588
|
Foursquare@TKY
|
38,742
|
2,212
|
9,624
|
1,055
|
Gowalla@DC
|
13,594
|
4,795
|
5,826
|
880
|
Gowalla@CHI
|
10,314
|
3,269
|
2,542
|
627
|
Brightkite
|
75,522
|
4,038
|
33,008
|
1,502 |
FIG. 1 is an anchor-like link model AL for capturing associations between social network user vectors and user access preference vectors based on a multi-tier perceptron (MLP).
As shown in fig. 1, step S1 is first substituted with the user relationship topology G in Foursquare: data _ process (G) → Tra, Tes, a training set Tra and a test set Tes are acquired. The positive sample G' in Tra is substituted into step S2:
wherein the social network user vector can be obtained using a common network learning representation method, such as node2vec
Next, step S3 is substituted first using the user check-in record S in Foursquare:
obtaining a user access preference vector
The outputs of steps S2 and S3
And
input into model AL, using step S4:
obtaining u
i 'v。
Table 2: effect of social link prediction on three real data sets
FIG. 2 is an overall model architecture of a social network link prediction method based on multi-source heterogeneous data fusion.
As shown in FIG. 2, the outputs of steps S2 and S4
And u
i 'vInput into the prediction model CNN, using step S5:
and acquiring a final prediction result. The link prediction effect of the hybrid model is shown in table 2.
# check _ ins indicates the number of user check-in records;
# POIs indicates the number of different POIs in the user check-in record;
# edges indicates the number of links in the user relationship topology;
# users represents the number of users in the user relationship topology graph (or user check-in record);
foursquare @ NYC represents data in the data set Foursquare with the area of New York City;
foursquare @ TKY represents data in the data set Foursquare with the region of Tokyo;
gowalla @ DC represents data in the data set Gowalla with regions of Washington;
gowalla @ CHI represents data in the data set Gowalla with the region Chicago;
vec2 link-is a method of social network link prediction based on multi-source heterogeneous data fusion without using LSH;
vec2link + is a social network link prediction method based on multi-source heterogeneous data fusion and using LSH, and compared with vec2link-, the method embodies the advantages that after LSH is used, storage occupation is low, and calculation speed is increased;
avage, Hadamard, Weighted-L1, Weighted-L2 are comparative methods of vec2link +, which only use the information of the user relationship topology map in LBSN, and the implementation scheme can be referred to the literature [ Grover, Aditya, and J Leskovec ] "node2vec: Scalable features learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and timing.
Jaccard is a vec2link + comparison method for comparing similarity and difference between finite sample sets;
walk2friend is a comparison method of vec2link +, which uses only data recorded by user check-in lbs n, and the implementation can be found in references [ backs, Michael, et al ] "Walk2friends: involved Social Links from Mobility profiles." Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications security.acm,2017 ].
From the test results in table 2, it can be seen that the prediction effect of the method for predicting the social network link based on the multi-source heterogeneous data fusion is comprehensively superior to that of the method for predicting the link only using a single data source.
Therefore, the method can effectively overcome the problem of inaccurate prediction result when the single data source in the LBSN is used for link prediction, and realizes the improvement of the link prediction effect. The invention adopts a deep learning method to fuse two heterogeneous data sources of a user relationship topological graph and a user check-in record in the LBSN to realize link prediction. And meanwhile, the LSH is adopted, and a discrete binary vector represents a user node, so that the calculation speed of the model is accelerated, and the storage overhead is saved. The invention achieves the purposes of high calculation speed, less storage consumption and better link prediction effect than single-source data prediction effect.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.