Open Access Published by De Gruyter January 10, 2015

Homophily-Based Link Prediction in The Facebook Online Social Network: A Rough Sets Approach

Islam Elkabani and Roa A. Aboo Khachfeh

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2014-0031

Abstract

Online social networks are highly dynamic and sparse. One of the main problems in analyzing these networks is the problem of predicting the existence of links between users on these networks: the link prediction problem. Many studies have been conducted to predict links using a variety of techniques like the decision tree and the logistic regression approaches. In this work, we will illustrate the use of the rough set theory in predicting links over the Facebook social network based on homophilic features. Other supervised learning algorithms are also employed in our experiments and compared with the rough set classifier, such as naive Bayes, J48 decision tree, support vector machine, logistic regression, and multilayer perceptron neural network. Moreover, we studied the influence of the “common groups” and “common page likes” homophilic features on predicting friendship between users of Facebook, and also studied the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes compared with using the overlap coefficient. We conducted our experiments on two different datasets obtained from the Facebook online social network, where users in each dataset live within the same geographical region. The results showed that the rough set classifier significantly outperformed the other classifiers in all experiments. The results also demonstrated that the common groups and the common page likes features have a significant influence on predicting the friendship between users of Facebook. Finally, the results revealed that using the overlap coefficient homophilic features provided better results than that of the Jaccard coefficient features.

Keywords: Link prediction; homophily; online social networks; rough set theory

1 Introduction

An online social network (e.g., Facebook, LiveJournal, and Google+) is a social structure that models the interaction between people in a group or community [2]. These online social networks allow members to maintain user profiles with basic information (geographic location, gender, etc.), interests, and friends. This user profile represents the basis for grouping users and recommending friendship links between them.

An online social network can be represented as a graph where nodes are the set of actors (together with their information) and the edges represent some type of associations between these actors, such as friendship or interaction links. The rapid growth of these social networks and the fact that new nodes are added to the graph over time defines a basic problem underlying social network evolution: the link prediction problem. It is the problem of predicting the likelihood of a future association (e.g., friendship, interaction) between two nodes, knowing that there is no association between the nodes in the current state of the graph [2].

Link prediction is a subfield of analyzing social network in which the target is to deduce some unobservable links based on the existent observations and relationships. It also has many applications in different domains, like in e-commerce (building recommendation systems), bioinformatics (protein-protein interactions), and many security-related applications (identifying hidden groups of criminals) [10].

Link prediction methods can generally be grouped into two approaches: those that just use the link structure of the network and those that use both the attributes and nodes of the network. In the former approach, topological features are used to measure the connectivity of nodes in the network. However, the latter approach incorporates additional similarity features that measure the correspondence among the attributes of the nodes based on the sociological principle of “homophily” [12]. Homophily is the tendency of individuals with similar interests, culture, or language to associate and bond with each other [17].

Many algorithms were proposed for predicting links in social networks [10]. Some of these algorithms were based on Bayesian probabilistic models and probabilistic relational models. Some other algorithms were based on computing a similarity score between a pair of nodes so that a supervised learning method, such as decision trees or neural networks, can be employed. In this work, we study predicting friendship links on the Facebook online social network based on the users’ homophilic features. We experiment using the rough set-based algorithm and five other supervised learning algorithms. Their performance in the link prediction problem is evaluated, and a comparative analysis among them is performed. We also study the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes compared with using the overlap coefficient. Finally, we study the effect of using two additional homophilic features – common groups and common page likes – on predicting friendship links in the Facebook online social network.

The remainder of the article is structured as follows: Section 2 describes related works. In Section 3, the rough set-based classification is introduced. Section 4 describes the data collection process from the Facebook online social network. In Section 5, the set of homophilic features used in our experiment are outlined. Section 6 introduces our new approach and describes the experiments. In Section 7, the results of the experiments are presented and discussed. Finally, Section 8 concludes the article and provides a brief about our future work.

2 Related Works

A considerable amount of work has been recently conducted to investigate the problem of link prediction in social networks [10]. Some approaches are based on Bayesian probabilistic models and probabilistic relational models. Others are posed as a binary classification model where a similarity score is computed between a pair of nodes, based on a set of features: graph topological features (network structure) or edge/vertex attributes.

Liben-Nowell and Kleinberg [16] formalized the link problem as follows: given a snapshot of a social network, can we infer new interactions among its members? They developed an approach for link prediction based on measuring and analyzing the proximity of nodes in a network (e.g., common neighbors, SimRank).

Hasan and Zaki [10] extended this work in two ways. First, they showed that using external data outside the scope of graph topology can significantly improve the prediction result. Second, they used various similarity metrics as features in a supervised learning setup.

Steurer and Trattner [24] studied the extent to which interactions between users in online social networks can be predicted by looking at features (topological and homophilic) from social network and position data. They used the binomial logistic regression algorithm for link prediction.

Backstrom and Leskovec developed in [4] an algorithm based on supervised random walks that naturally combines the information from the network structure with node and edge level attributes.

In [8], Golder and Yardi found, by experimenting on the microblogging service Twitter, that two structural characteristics – transitivity and mutuality – are significant predictors of the desire to form new ties. They investigated the link prediction problem from the natural topological perspective.

Gilbert and Karahalios proposed in [7] a regression model to predict friendship links on Facebook based on a set of homophilic features consisting of users’ demographic and interactions but not including their interests. A data set consisting of 35 users of Facebook were used in their study.

Patil et al. [20] considered the problem of predicting friendships between actors in a social network from only their sociodemographic attributes. Two different kinds of social networks were studied: a popular online friendship network and an offline network of mutual partner choices in speed dating. The decision tree was used as the classifier. It achieved an accuracy exceeding 85% on the online social network and exceeding 90% on the speed-dating network.

In [1], Aiello et al. formalized the link prediction problem as follows: “Given a subset of users U_t ⊆ U, we want to predict the presence or absence of a social link for every pair (u, v)∈{U_t × U_t|u ≠ v}.”. They studied the correlation between social and semantic features in three online social networks – Flickr, Last.fm, and aNobii – based on four different types of features: groups, tags, library, and tagged items. The J48 decision tree was used as the binary classifier in their work.

Most of the previous studies applied various classification algorithms such as naive Bayes, binomial logistic regression, decision tree, support vector machine (SVM), and multilayer perceptron (MLP) to predict links between users in online social networks based on topological or homophilic features. In our work, we investigate the link prediction problem from the network homophilic features perspective in the Facebook online social network. The rough set-based classification algorithm [14] is used for link prediction, and its performance is compared with that of other various classification algorithms.

3 Rough Set-Based Classification

The rough set theory (RST) was developed by Zdizslaw Pawlak in the early 1980s [21]. It has been used in the classification of imprecise, uncertain, or incomplete information [15]. RST has been successfully applied in many real-life problems in medicine, banking, toxicity predictions, and financial markets [13, 15]. One of the important features of RST is its ability to identify the importance of different attributes and to reduce them using one of its reduction algorithms. Finding minimal subsets (reducts) of attributes that are efficient for rule making is a central part of its process, which consequently improves the efficiency of the predictor [22].

In RST, the structure of data is represented in the form of a decision system (DS) [3]: DS = (U, A∪{d}), where U is a non-empty finite set of observations objects called the universe, whereas A is a non-empty finite set of attributes. Every attribute a∈A is a total function of the form a: U → Va, where Va is the set of allowable values for the attribute a. The attributes belonging to A are called conditional attributes, whereas d is called a decision attribute.

A reduct (R) of A refers to the minimal selection of attributes that can be used to represent all classes of the decision system, thus preserving the original classification defined by A. A reduct has two main properties:

C(R) = C(A), i.e., R produces the same classification (C) of objects as the collection A of all attributes.
For any attribute a∈R, C(R – {a}) ≠ C(R), i.e., a reduct is a minimal subset with respect to property 1.

A method of computing reducts is given by Skowron and Rauszer [23] and described as follows. First, a discernibility matrix is constructed from the data set. This matrix shows the differences between each pair of objects with respect to attributes. A discernibility matrix is defined as an n× n matrix, where n is the number of objects in the dataset. The general (i,j)-th term c_ij of this matrix is the set of attributes for which the i-th (x_i) and j-th(x_j) object of the dataset differ:

Cij = {a∈A|x∈U and a(xi) ≠ a(xj)}, for i,j = 1,…,n.

Second, a Boolean discernibility function f is constructed. This function describes constraints that must hold to preserve the discernibility between all pairs of discernible objects in the information system, and requires keeping at least one attribute from each non-empty element of the discernibility matrix corresponding to any pair of discernible objects. The discernibility function is parallel to the discernibility matrix and represented as follows:

f (a¯1,…a¯m) = V{Λc¯ij|1 ≤ j ≤ i ≤ n, cij ≠ ϕ},

where a̅₁…a̅_m are Boolean variables corresponding to the m attributes in A, and c̅_ij = {a̅|a∈c_ij}.

Moreover, in case of classifying data with respect to the decision attribute d, decision-relative discernibility reducts are computed and a decision is made. Thus, a decision-relative discernibility matrix is constructed similar to the discernibility matrix stated above. This matrix shows the differences between each pair of objects, with respect to attributes and decision classes. In this case, the two objects x_i and x_j of the dataset take different values and are also in different decision classes:

c′ij = {{a∈A|x∈U and a(xi) ≠ a(xj)} if d(xi) ≠ d(xj)ϕ if d(xi) = d(xj).

From this decision-relative discernibility matrix, the decision-relative discernibility function is constructed in the same way as the discernibility function stated above:

f′(a¯1,…,a¯m) = V{Λc′¯ij|1 ≤ j ≤ i ≤ n, c′ij ≠ ϕ}.

As a result, the set of prime implicants of this function determines the set of all reducts of A.

One of the most efficient rough set reduction algorithms is the Johnson algorithm [18]. It is a typical greedy algorithm that tends to find a single prime implicant of minimal length. It provides a reduction method that is fast and efficient with the minimum overhead.

Algorithm 1 represents the Johnson algorithm.

Algorithm 1

The algorithm starts by initializing the current reduct candidate R to the empty set. Then, each conditional attribute appearing in the discernibility function is evaluated according to the heuristic measure representing the count of the number of appearances an attribute makes within clauses. Attributes that appear more frequently are more significant. The attribute with the highest heuristic value is added to R, and all clauses in the discernibility function containing this attribute are removed. The algorithm terminates when all clauses have been removed, and the reduct R is obtained [11].

The Johnson algorithm is implemented in the ROSETTA software [18] used in our rough set-based link prediction algorithm. In this work, we used rough sets for classification to exploit two of its main advantages over other traditional classifiers: (i) selecting an optimal subset of attributes quickly and effectively, and (ii) dealing effectively with the missing attribute values problem.

4 Data Collection

We conducted our experiment on two different datasets obtained from the online social network Facebook, where its structure can be represented as an unweighted, undirected graph.

Two common methods are used by researchers to collect data from online social networks. First, the API (application programming interface) calls if they are supported by the online social network operator. This method allows third parties of an online social network to access some of their specified data precisely. The second method is to perform data scraping straight from the online social network website by requesting the specific URLs and fetching the web page as if we are using a web browser, and then parsing the returned HTML page and extracting the needed information.

In Facebook, the API calls are more restrictive than HTML scraping. For our work, we are only interested in public data that are available for any logged-in user. Thus, we have implemented our Facebook crawler, by HTML scraping, using the breadth-first search (BFS) algorithm [5]. This algorithm is a classic graph traversal algorithm that starts from a seed node and then iteratively explores the neighboring nodes. In each iteration, the next but not yet visited node is selected.

Algorithm 2 shows the general process of crawling an online social network based on the BFS method.

Algorithm 2

Our crawler has been implemented in Java and SQL Server 2008. We run it on a machine with 2-GB RAM and a data storage space up to 5 GB. It can be described as follows:

The Facebook server is contacted through an HTTP request providing credentials required for the authentication through the cookies that Facebook maintains for each request and response.
Once logged in, HTTP requests are sent to open the seed profile to be processed to retrieve the needed data. For example, to retrieve the personal information of a node, the corresponding URI used is https://m.facebook.com/nodeID.
The list of friends for this node is also retrieved through the following URI: https://m.facebook.com/ profile.php?v=friends&id=nodeID.
The collected HTML pages are parsed using regular expressions, and the relevant information is extracted.
The friends’ IDs of the visited node are inserted into a FIFO queue to be visited cyclically.

The information obtained about each user profile visited during crawling are as follows: ID (which is a 32-bit number that uniquely identifies a user), name, gender, birthday, activities, athletes, education, games, interests, languages, movies, music, sport, teams, TV shows, work, books, religion, political view, city, hometown, pages liked, groups, and the list of friends. We downloaded the profile data of 5038 users for the first dataset and 1430 users for the second dataset. For each dataset, all of the Facebook users have a common feature – they live in the same geographical region. This is because socially connected pairs of users are more likely co-residents of the same geographical location [20]. Both data sets are generated in a way that yielded two balanced data sets to ensure the accuracy of the classification process.

5 Homophilic Feature Sets

As already mentioned, the aim of this work is to predict friendships between users in the Facebook online social networks based on their homophilic features. We can consider the social network as an undirected graph G_F (V_F, E_F), with V_F representing the users and e= (u, v)∈E_F is the friendship relation existing between the users u and v, where this relationship is based on a mutual agreement that both users confirm.

Users of Facebook can enhance their social network profile by adding information such as interests, activities, language, and favorite movies. Additionally, users may join common-interest groups, organized by workplace, school or college, or other characteristics; they can also like other pages on the network. We define a set of homophilic features based on the attributes collected from the users’ profiles after discarding those attributes that have missing values for most of the users. The attributes selected for use in prediction are activities, athletes, education, games, interests, languages, movies, music, teams, TV shows, work, books, groups, and page likes. The set of homophilic features is calculated as the attributes’ similarity between users u, v.

The Jaccard coefficient is the first homophilic feature used to measure the similarity, and it is defined as the size of the intersection divided by the size of the union of the sample sets [6].

Jaccard coefficient for interests: This homophilic feature is a combination of total interests and common interests between two users:
JC_Interests(u, v) = |I(u)∩I(v)||I(u)∪I(v)|,
where I(u) and I(v) are the set of interests for the users u and v, respectively.
The Jaccard coefficient is also applied to the rest of the selected user profile attributes. However, we intuitively figured out that a better homophilic feature for measuring the similarity, especially of multivalued attributes, can be found by using the overlap coefficient defined as the size of the intersection between two sets divided by the minimum size of the two sets.
Overlap coefficient for interests:
OC_Interests(u, v) = |I(u)∩I(v)|Min (|I(u)|,|I(v)|),
The overlap coefficient is also applied to the rest of the selected user profile attributes.

6 Homophily-Based Link Prediction

Figure 1 describes the steps followed in our approach to predict friendship links in the Facebook online social network based on the users’ homophilic features.

Figure 1.

Homophily Based Link Prediction Process.

The first step was the data collection by crawling the online social network Facebook for users’ profiles as described in Section 4. After collecting the data, the following data preprocessing steps were applied:

The attributes having missing values for the most of users’ profiles were discarded.
The isolated users, whose friends’ list is missing owing to privacy settings, were removed from our collected profiles.
The users’ profiles having missing values for most of their attributes also were removed from our collected profiles.

Table 1 shows the remaining attributes used in our work and the percentage of availability for each attribute in the collected profiles after the preprocessing step.

Table 1.

Percentage of Available Data for Each Attribute.

Attribute	% of Available Data
Attribute	First Dataset	Second Dataset
Activities	58	54
Athletes	57	58
Education	64	59
Games	42	40
Interests	50	55
Languages	21	25
Movies	64	67
Music	78	79
Teams	50	58
TV Shows	68	73
Work	56	61
Books	52	49
Groups	61	86
Page Likes	90	92

The next step was the pairing of profiles; 10,054 user pairs were generated from the first dataset collected and 3000 user pairs from the second one. For each pair of profiles, two data sets were generated: the first dataset is the set of homophilic features generated based on the Jaccard coefficient similarity, and the second dataset is the set of homophilic features generated based on the overlap coefficient similarity. In both cases, the score obtained has a value between 0 and 1. Thus, we chose a parameter (threshold) as a cutoff point to decide whether the pair of profiles matches on a given attribute or not. We varied the threshold in steps of 0.1 (ranging from 0.1 to 1.0), with a threshold of 0.1 indicating a loose matching and a threshold of 1.0 indicating a tight matching.

Each dataset consists of a set of matching along with the ID of the first user and that of the second user. Each single matching is a t bit vector, with each bit denoting whether the pair of users matches on a particular attribute A_t. A single bit is also used to indicate whether a link exists between a pair of users in the network. A value of 0 (FALSE) indicates that there is no link or friendship relation between the pair of users, and a value of 1 (TRUE) indicates that the users are friends. Two experiments were conducted on the two Facebook datasets described earlier to study the extent to which friendship between users in online social networks can be predicted on the basis of homophilic features.

In the first experiment, we predicted links by employing five different classification algorithms adopted in most of the previous related works. In the second experiment, the rough set classification algorithm was employed for predicting links between users. Moreover, two cases were taken into consideration for both experiments. In the first case, we excluded from the attributes that we have listed earlier the groups and page likes. In the second case, these two attributes were included, and the set of homophilic features were augmented with two more features to measure the common groups and the common page liked by pair of users. These two attributes were rarely studied in the literature for predicting links between users based on homophilic features. In our experiments, we decided to study their influence on predicting friendship between users on the Facebook online social network.

For both experiments, we considered the social network as an undirected graph for the computation of the edge features (i.e., the existence of a link between two users). In the first dataset, of the 10,054 user pairs used in our experiments, there were 5106 pair of edges (u, v) in the undirected graph (i.e., u and v are currently friends on the social network) and 4948 pairs of users without edges between them (i.e., they are not currently friends on the social network). As for the second dataset, of the 3000 user pairs used in our experiments, there were 1590 pairs of edges (u, v) in the undirected graph and 1410 pairs of users without edges between them. This will yield two balanced datasets ready for training and testing.

In the classification process, 70% of user pairs in our dataset were used for training and 30% were used for testing. Furthermore, we used a 10-fold cross-validation approach to justify our findings. We trained and tested our classifier at each of the thresholds mentioned before (0.1, 0.2,…, 1.0).

6.1 Predicting Using Traditional Classifiers

Five different classification algorithms used in previous related works were employed: naive Bayes, SVM, J48 decision tree, simple logistic regression, and multilayer perceptron (MLPClassifier). The datasets were available as ARFF files. WEKA, a well-known machine learning library [9], was used.

6.2 Predicting Using the Rough Set Approach

In this experiment, the dataset was available as an Excel sheet and loaded in ROSETTA toolkit [18] as an ODBC. The Johnson reduced algorithm was employed to reduce the number of measured attributes and generate the reduct and rules.

7 Results

In this work, we used the F-measure (F₁) for an equal weighting of precision and recall as an evaluation metric for the classification performance [19]. Tables 2–5 show the results (precision, recall, and F₁) obtained for the different classifiers used at threshold 0.8 from the first dataset (10,054 user pairs).

Table 2.

Homophilic Features Calculated Using the Jaccard Coefficient: Without Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.71	0.702	0.7	0.707	0.702	0.701
J48 decision tree	0.655	0.654	0.654	0.662	0.661	0.661
Logistic regression	0.715	0.668	0.647	0.721	0.669	0.649
SVM	0.714	0.665	0.642	0.72	0.668	0.646
MLPClassifier	0.72	0.67	0.648	0.721	0.673	0.655
Rough set	0.876	0.834	0.833	0.877	0.841	0.837
Average F₁ measure for the six classifiers	0.687			0.691

Table 3.

Homophilic Features Calculated Using the Overlap Coefficient: Without Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.675	0.675	0.675	0.678	0.677	0.677
J48 decision tree	0.671	0.671	0.671	0.634	0.63	0.627
Logistic regression	0.818	0.816	0.815	0.82	0.817	0.816
SVM	0.812	0.807	0.807	0.814	0.809	0.808
MLPClassifier	0.824	0.824	0.824	0.821	0.821	0.821
Rough set	0.945	0.939	0.938	0.915	0.924	0.925
Average F₁ measure for the six classifiers	0.788			0.779

Table 4.

Homophilic Features Calculated Using the Jaccard Coefficient: With Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.682	0.669	0.664	0.707	0.702	0.701
J48 decision tree	0.659	0.658	0.658	0.662	0.661	0.661
Logistic regression	0.716	0.669	0.648	0.721	0.669	0.649
SVM	0.716	0.665	0.642	0.72	0.668	0.646
MLPClassifier	0.718	0.67	0.649	0.721	0.673	0.655
Rough set	0.946	0.941	0.941	0.937	0.928	0.927
Average F₁ measure for the six classifiers	0.701			0.706

Table 5.

Homophilic Features Calculated Using the Overlap Coefficient: With Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.667	0.666	0.666	0.665	0.665	0.665
J48 decision tree	0.676	0.676	0.676	0.693	0.693	0.693
Logistic regression	0.828	0.826	0.825	0.829	0.826	0.825
SVM	0.819	0.815	0.814	0.825	0.821	0.821
MLPClassifier	0.83	0.83	0.83	0.832	0.831	0.83
Rough set	0.948	0.944	0.943	0.942	0.943	0.94
Average F₁ measure for the six classifiers	0.792			0.796

Tables 6–9 show the results (precision, recall, and F₁) obtained for the different classifiers used at threshold 0.8 from the second dataset (3000 user pairs).

Table 6.

Homophilic Features Calculated Using the Jaccard Coefficient: Without Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.624	0.624	0.576	0.588	0.601	0.576
J48 decision tree	0.697	0.683	0.655	0.69	0.678	0.651
Logistic regression	0.706	0.646	0.576	0.702	0.664	0.616
SVM	0.713	0.655	0.59	0.705	0.669	0.624
MLPClassifier	0.707	0.649	0.583	0.703	0.668	0.623
Rough set	0.824	0.823	0.822	0.773	0.745	0.726
Average F₁ measure for the six classifiers	0.634			0.636

Table 7.

Homophilic Features Calculated Using the Overlap Coefficient: Without Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.713	0.649	0.581	0.702	0.671	0.63
J48 decision tree	0.712	0.646	0.574	0.704	0.668	0.622
Logistic regression	0.715	0.655	0.591	0.702	0.666	0.619
SVM	0.699	0.685	0.657	0.691	0.679	0.652
MLPClassifier	0.709	0.656	0.594	0.704	0.668	0.622
Rough set	0.828	0.817	0.812	0.869	0.861	0.858
Average F₁ measure for the six classifiers	0.635			0.667

Table 8.

Homophilic Features Calculated Using the Jaccard Coefficient: With Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.666	0.652	0.611	0.666	0.651	0.611
J48 decision tree	0.716	0.651	0.582	0.706	0.668	0.621
Logistic regression	0.709	0.669	0.62	0.708	0.663	0.612
SVM	0.71	0.668	0.619	0.705	0.662	0.611
MLPClassifier	0.709	0.668	0.619	0.706	0.661	0.608
Rough set	0.88	0.846	0.874	0.859	0.854	0.852
Average F₁ measure for the six classifiers	0.654			0.652

Table 9.

Homophilic Features Calculated Using the Overlap Coefficient: With Groups and Page Likes Attributes.

Algorithm	Test Mode
	10-fold cross-validation			Testing and Training
	Precision	Recall	F₁	Precision	Recall	F₁
Naive Bayes	0.7	0.691	0.669	0.706	0.696	0.674
J48 decision tree	0.714	0.666	0.612	0.705	0.659	0.604
Logistic regression	0.723	0.701	0.674	0.723	0.702	0.677
SVM	0.71	0.649	0.582	0.704	0.669	0.624
MLPClassifier	0.727	0.703	0.676	0.723	0.704	0.681
Rough set	0.909	0.907	0.906	0.892	0.887	0.886
Average F₁ measure for the six classifiers	0.687			0.691

In all experiments, two testing modes were adopted, which are the 10-fold cross-validation mode and the testing-training mode. Tables 2 and 6 show the results obtained by using the Jaccard coefficient homophilic features and applying the six different classifiers on the first and second dataset, respectively. As for the first dataset, in both testing modes, the least F₁ values were obtained from applying the SVM classifier (between 0.642 and 0.646) and the highest F₁ values were obtained from applying the rough set classifier (between 0.833 and 0.837). For the second dataset, in both testing modes, the least F₁ values were obtained from applying the naive Bayes classifier (0.576) and the highest F₁ values were obtained from applying the rough set classifier (between 0.726 and 0.822). Tables 3 and 7 show the results obtained by using the overlap coefficient homophilic features and applying the six different classifiers on the first and second datasets, respectively. As for the first dataset, in both testing modes, the least F₁ values were obtained from applying the J48 decision tree classifier (between 0.627 and 0.671), whereas the highest F₁ values were obtained from applying the rough set classifier (between 0.925 and 0.938). For the second dataset, the least F₁ values were obtained from applying the J48 decision tree classifier (between 0.574 and 0.622), whereas the highest F₁ values were obtained from applying the rough set classifier (between 0.812 and 0.858). Tables 4 and 8 show the effect of including the groups and page likes attributes as the Jaccard coefficient homophilic features in predicting friendship on the first and second dataset, respectively. For both datasets, the F₁ values increased for most of the classifiers except for the naive Bayes classifier in case of the first dataset and the J48 decision tree in case of the second dataset. Finally, Tables 5 and 9 show the effect of adding the groups and page likes attributes as the overlap coefficient homophilic features in predicting friendship on the first and second dataset, respectively. Again, in this experiment, for both datasets, the F₁ values increased for most of the classifiers except for the naive Bayes classifier in case of the first dataset and the SVM classifier in case of the second dataset.

On the basis of these results, we can conclude the following major findings:

The rough set classifier outperformed the other classifiers in all experiments.
The results of the four different experiments conducted on the first dataset (presented in Tables 2–5) show that the average F₁ measure obtained for the rough set classifier in the four experiments is 0.914 for the 10-fold cross-validation mode and 0.907 for the testing-training mode. These averages exceed all the average F₁ measures of the other five classifiers. The highest average F₁ measure among the five classifiers is that obtained from the MLP classifier, which is 0.738 for the 10-fold cross-validation mode and 0.740 for the testing-training mode.
As for the results of the four experiments conducted on the second dataset (shown in Tables 6–9), the average F₁ measure for the rough set classifier in the four experiments is 0.853 for the 10-fold cross-validation mode and 0.831 for the testing-training mode. Again, these averages exceed the average F₁ measure of the five other classifiers. The highest average F₁ measure among these classifiers is that obtained from the MLP classifier, which is 0.618 for the 10-fold cross-validation mode and 0.634 for the testing-training mode.
Moreover, by comparing the values of the F₁ measure resulting from the experiment conducted on the small dataset (3000 pairs of user profiles) and the larger dataset (10,000 pairs of user profiles), it can be seen that all classifiers perform better when the size of the dataset increases. This indicates the consistency of the results obtained from the different classifiers, especially the rough set classifier.
Thus, on the basis of the aforementioned results, it is obvious that the rough set classifier significantly outperforms the other classifiers in homophily-based link prediction in the Facebook online social network. Although the homophily-based link prediction in online social networks is always associated with many difficulties owing to the missing attributes or incomplete attributes of users’ profiles, the rough sets classification proved a distinguished performance in homophily-based link prediction in the Facebook online social network. This is due to the two main features of rough sets: first, the Johnson reduction algorithm, which provides an effective method for identifying the minimal and optimal subset of attributes. Second, its ability of handling missing valued attributes effectively.
The common groups and the common page likes features have a significant influence on predicting the friendship between users.
The results of the two experiments conducted on the first dataset presented in Tables 2 and 4 show that the value of the average F₁ measure for all classifiers obtained using the Jaccard coefficient homophilic features without the common groups and page likes is 0.687 for the 10-fold cross-validation mode and 0.691 for the testing-training mode. These results increased after including the common groups and page likes to 0.701 for the 10-fold cross-validation mode and 0.706 for the testing-training mode. Also, the results of the two experiments conducted on the second dataset, and presented in Tables 6 and 8, show that the value of the average F₁ measure for all classifiers obtained using the Jaccard coefficient homophilic features without the common groups and page likes is 0.634 for the 10-fold cross-validation mode and 0.636 for the testing-training mode. These results increased after including the common groups and page likes to 0.654 for the 10-fold cross-validation mode and 0.652 for the testing-training mode. The same results were obtained after using the overlap coefficient homophilic features instead of the Jaccard coefficient for both datasets. The average F₁ measure also increased after including the common groups and page likes in the case of using the overlap coefficient homophilic features. These results demonstrate that the common groups and the common page likes features have a significant influence on predicting the friendship between users. In other words, users sharing the same groups and page likes are more likely to be friends on the Facebook online social network.
Using the overlap coefficient homophilic features provided better results than that of the Jaccard coefficient features.
The results of the experiments conducted on the first dataset and presented in Tables 4 and 5 show that the value of the average F₁ measure using the overlap coefficient homophilic features is 0.792 for the 10-fold cross-validation mode and 0.796 for the testing-training mode. These averages exceed the value of the average F₁ measure using the Jaccard coefficient homophilic features, which is 0.701 for the 10-fold cross-validation mode and 0.706 for the testing-training mode. Also, the results of the experiments conducted on the second dataset and presented in Tables 8 and 9 show that the value of the average F₁ measure using the overlap coefficient homophilic features is 0.687 for the 10-fold cross-validation mode and 0.691 for the testing-training mode. These averages exceed the value of the average F₁ measure using the Jaccard coefficient homophilic features, which is 0.654 for the 10-fold cross-validation mode and 0.652 for the testing-training mode. These results show that using the overlap coefficient homophilic features provides better performance than using the Jaccard coefficient features. This is because most of the user profile attributes used in this study were multivalued attributes.

8 Conclusion and Future Work

Link prediction is an important task in analyzing and understanding social networks. Some approaches for link prediction are based on topological features, and others integrate these features with the homophilic features. In this work, we investigated the link prediction problem from a homophilic perspective based on the homophily principle, which states that a contact between similar people occurs at a higher rate than among dissimilar people. For this purpose, datasets from the Facebook online social network were collected. Various classifiers were applied for prediction. The results of the rough set algorithm outperformed the other traditional classifying algorithms. Moreover, including the common groups shared among pairs of users and the common pages they like improved the results of the link prediction, which means that they have a significant influence on predicting friendship between users on online social networks.

As for near-future work, we will incorporate some topological features in link prediction, such as the Jaccard coefficient for common neighbors and the Adamic-Adar measure for link prediction using the different classifiers stated in this work and the rough sets classifier. We will compare the influence of each category of features (homophilic and topological) separately in link prediction, and then we will investigate what will be the results of combining both categories together when predicting links in online social network.

Corresponding authors: Islam Elkabani, Faculty of Science, Mathematics and Computer Science Department, Beirut Arab University, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon; and Faculty of Science, Mathematics and Computer Science Department, Alexandria University, Alexandria, 21526, Egypt, e-mail: islam.kabani@bau.edu.lb; and Roa A. Aboo Khachfeh, Faculty of Science, Mathematics and Computer Science Department, Beirut Arab University, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon, e-mail: roa_k@bau.edu.lb

Bibliography

[1] L. M. Aiello, A. Barrat, R. Schifanella, C. Cattuto, B. Markines and F. Menczer, Friendship prediction and homophily in social media, ACM Trans. Web6 (2012), 1–33.10.1145/2180861.2180866Search in Google Scholar

[2] M. Al Hasan, V. Chaoji, S. Salem and M. Zaki, Link prediction using supervised learning, in: Proceedings of SDM Workshop of Link Analysis, Counterterrorism and Security, 2006.Search in Google Scholar

[3] Q. Al-Radaideh, The impact of classification evaluation methods on rough sets based classifiers, in: Proceedings of the 2008 International Arab Conference on Information Technology (ACIT2008), University of Sfax, Tunisia, December 2008.Search in Google Scholar

[4] L. Backstrom and J. Leskovec, Supervised random walks: Predicting and recommending links in social networks, in: ACM WSDM, 2011.10.1145/1935826.1935914Search in Google Scholar

[5] S. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti, Crawling Facebook for social network analysis purposes, in: Proc. International Conference on Web Intelligence, Mining and Semantics, p. 52, ACM, Sogndal, Norway, 2011.10.1145/1988688.1988749Search in Google Scholar

[6] S. Cesare and Y. Xiang, Software Similarity and Classification, Springer Briefs in Computer Science, 2012.10.1007/978-1-4471-2909-7Search in Google Scholar

[7] E. Gilbert and K. Karahalios, Predicting tie strength with social media, in: Proc. of CHI, 2009.10.1145/1518701.1518736Search in Google Scholar

[8] S. A. Golder and S. Yardi, Structural predictors of tie formation in Twitter: transitivity and mutuality, in: Proc. SocialCom, pp. 88–95, 2010.10.1109/SocialCom.2010.22Search in Google Scholar

[9] S. A. Hall M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. Witten, The WEKA data mining software: an update, in: ACM SIGKDD Explorations Newsletter, pp. 10–18, 2009.10.1145/1656274.1656278Search in Google Scholar

[10] M. Hasan and M. Zaki, A survey of link prediction in social networks, in: C. Aggarwal, ed., Social Network Data Analytics, 1st Ed., pp. 243–276, Springer, New York, 2011.10.1007/978-1-4419-8462-3_9Search in Google Scholar

[11] R. Jensen, Rough set-based feature selection: a review, in: Rough Computing: Theories, Technologies and Applications, pp. 70–107, 2008.10.4018/978-1-59904-552-8.ch003Search in Google Scholar

[12] I. Kahanda and J. Neville, Using transactional information to predict link strength in online social networks, in: Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media (ICWSM’09), pp. 74–81, 2009.Search in Google Scholar

[13] A. Khan, G. Saha, S. Dasgupta and S. K. Datta, Prediction of dominant genes responsible for lung adenocarcinoma using rough set theory, in: E-Health and Bioengineering Conference (EHB), pp. 24–26, November 2011.Search in Google Scholar

[14] L. P. Khoo, S. B. Tor and L. Y. Zhai, A rough-set-based approach for classification and rule induction, Int. J. Adv. Manuf. Technol.4 (1999), 438–444.10.1007/s001700050088Search in Google Scholar

[15] S. J. Lee, J. J. Ahn and K. J. Oh, Using rough set to support investment strategies of real-time trading in futures market, Appl. Intell.32 (2010), 364–377.10.1007/s10489-008-0150-ySearch in Google Scholar

[16] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, in: CIKM ’03: Proceedings of the Twelfth International Conference on Information and Knowledge Management, New York, pp. 556–559, 2004.10.1145/956863.956972Search in Google Scholar

[17] M. McPherson, L. Smith-Lovin and J. M. Cook, Birds of a feather: homophily in social networks, Annu. Rev. Sociol.27 (2001), 415–444.10.1146/annurev.soc.27.1.415Search in Google Scholar

[18] A. Øhrn, ROSETTA Technical Reference Manual, Knowledge Systems Group, Department of Computer and Information Science, NTNU, Trondheim, Norway, November 1999.Search in Google Scholar

[19] D. L. Olson and D. Delen, Advanced Data Mining Techniques, Springer Publishing Company Incorporated, 2008.Search in Google Scholar

[20] A. Patil, J. Gao and A. van de Rijt, Homophily based link prediction in friendship networks, in: INSNA Sunbelt’10: The International Sunbelt Social Networks Conference XXX, Trento, Italy, June 29–July 4, 2010.Search in Google Scholar

[21] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci.11 (1982), 341–346.10.1007/BF01001956Search in Google Scholar

[22] M. Z. Rashad, A rough-neuro model for classifying opponent behavior in real-time strategy games, Int. J. Comput. Sci. Inf. Technol. (IJCSIT)4 (2012), 185–196.10.5121/ijcsit.2012.4515Search in Google Scholar

[23] A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems, Theor. Decis. Lib.11 (1992), 331–362.10.1007/978-94-015-7975-9_21Search in Google Scholar

[24] M. Steurer and C. Trattner, Predicting interactions in online social networks: an experiment in second life, in: MSM’13, Paris, France, May 1, 2013.10.1145/2463656.2463661Search in Google Scholar

Received: 2014-2-24

Published Online: 2015-1-10

Published in Print: 2015-12-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.