Abstract
Efficient spreading of important information through social media can be highly beneficial, while quick spreading of false content is alarming. Finding the users who are the most influential at information spreading can help develop efficient strategies. However, with the increasing growth of gigantic social networks, existing methods either lack accuracy or have high latency, sometimes being infeasible within limited memory. In this study, we find that rich user-specific information can guide us toward designing more effective methods. We propose UACD, a novel method for identifying the most influential spreaders on the Twitter social network by combining both user-specific and topological information. We provide a distributed implementation of our proposed algorithm on the Amazon EC2 and compare our ranking result with the state-of-the-art methods. Results suggest that UACD is scalable and can process a very large network while being on average \(\mathbf {12.5}\%\) more accurate and \(\mathbf {175}{\times }\) faster.
Similar content being viewed by others
References
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 2008 international conference on web search and data mining. ACM, pp 183–194
Ahajjam S, Badir H (2018) Identification of influential spreaders in complex networks using hybridrank algorithm. Sci Rep 8(1):11932
Akoglu H (2018) User’s guide to correlation coefficients. Turk J Emerg Med 18(3):91–93
Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2005) k-core decomposition: a tool for the visualization of large scale networks. arxiv:cs/0504107
Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2006) Large scale networks fingerprinting and visualization using the k-core decomposition. In: Advances in neural information processing systems, pp 41–50
Amazon EC (2006) Amazon. See https://aws.amazon.com/ec2/. Accessed 15 June 2018
Anderson RM, May RM, Anderson B (1992) Infectious diseases of humans: dynamics and control, vol 28. Wiley Online Library
Bakshy E, Rosenn I, Marlow C, Adamic L (2012) The role of social networks in information diffusion. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 519–528
Batagelj V, Zaveršnik M (2011) Fast algorithms for determining (generalized) core groups in social networks. Adv Data Anal Classif 5(2):129–145
Bayer JB, Ellison NB, Schoenebeck SY, Falk EB (2016) Sharing the small moments: ephemeral social interaction on snapchat. Inf Commun Soc 19(7):956–977
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, pp 12
Bhatia V, Rani R (2017) A parallel fuzzy clustering algorithm for large graphs using Pregel. Expert Syst Appl 78:135–144
Borgatti SP (1995) Centrality and aids. Connections 18(1):112–114
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Burgess JE (2011) Youtube. Oxford Bibliographies Online
Carmi S, Havlin S, Kirkpatrick S, Shavitt Y, Shir E (2007) A model of internet topology using k-shell decomposition. Proc Natl Acad Sci 104(27):11150–11154
Castillo C, Mendoza M, Poblete B (2011) Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web, pp 675–684
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining. ACM, p 4
Chan HK, Wang X, Lacka E, Zhang M (2016) A mixed-method approach to extracting the value of social media data. Prod Oper Manag 25(3):568–583
Chen W, Lakshmanan LVS, Castillo C (2013) Information and influence propagation in social networks. Synth Lect Data Manag 5(4):1–177
Chen X, Vorvoreanu M, Madhavan K (2014) Mining social media data for understanding students’ learning experiences. IEEE Trans Learn Technol 7(3):246–259
Cohen R, Havlin S, Avraham D (2003) Efficient immunization strategies for computer networks and populations. Phys Rev Lett 91:247901
Data Science Bootcamp. Understand Jaccard index, Jaccard similarity in minutes. Online; Accessed 17 Oct 2020
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Disney A (2020) Social network analysis 101: centrality measures explained. Online; Accessed 17 Oct 2020
Doerr B, Fouz M, Friedrich T (2012) Why rumors spread so quickly in social networks. Commun ACM 55(6):70–75
Dorogovtsev SN, Goltsev AV, Mendes JFF (2006) K-core organization of complex networks. Phys Rev Lett 96(4):040601
Driss OB, Mellouli S, Trabelsi Z (2019) From citizens to government policy-makers: social media data analysis. Gov Inf Q 36(3):560–570
Farahat A, LoFaro T, Miller JC, Rae G, Ward LA (2006) Authority rankings from hits, pagerank, and salsa: existence, uniqueness, and effect of initialization. SIAM J Sci Comput 27(4):1181–1201
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40:35–41
Fu Y-H, Huang C-Y, Sun C-T (2015a) Identifying super-spreader nodes in complex networks. Math Probl Eng 2015: 0-0
Fu K, Nune R, Tao JX (2015b) Social media data analysis for traffic incident detection and management. Technical report
Garton L, Haythornthwaite C, Wellman B (1997) Studying online social networks. J Comput Mediat Commun 3(1):JCMC313
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N project report, Stanford, vol 1, no 12
Guille A, Hacid H, Favre C, Zighed DA (2013) Information diffusion in online social networks: a survey. ACM SIGMOD Rec 42(2):17–28
Hagberg A, Swart P, Schult D (2019) NetworkX: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Online; Accessed 07 Dec 2019
Han M, Daudjee K, Ammar K, Özsu MT, Wang X, Jin T (2014) An experimental comparison of Pregel-like graph processing systems. Proc VLDB Endow 7(12):1047–1058
Heesterbeek JAP (2000) Mathematical epidemiology of infectious diseases: model building, analysis and interpretation, vol 5. Wiley, New York
Hodas NO, Lerman K (2014) The simple rules of social contagion. Sci Rep 4:4343
Hopcroft J, Lou T, Tang J (2011) Who will follow you back?: reciprocal relationship prediction. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, pp 1137–1146
Hu Y, Manikonda L, Kambhampati S (2014) What we Instagram: a first analysis of Instagram photo content and user types. In: Proceedings of the international AAAI conference on web and social media, vol 8
Iyer KV et al (2009) All-pairs shortest-paths problem for unweighted graphs in o (n2 log n) time. Int J Comput Inf Eng 3(2):320–326
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst (TOIS) 20(4):422–446
Keeling MJ, Rohani P (2008) Modeling infectious diseases in humans and animals. Princeton University Press, Princeton
Khaouid W, Barsky M, Srinivasan V, Thomo A (2015) K-core decomposition of large networks on a single pc. Proc VLDB Endow 9(1):13–23
Kitsak M, Gallos LK, Havlin S, Liljeros F, Muchnik L, Stanley HE, Makse HA (2010) Identification of influential spreaders in complex networks. Nat Phys 6(11):888–893
Klemm K, Serrano MÁ, Eguíluz VM, Miguel MS (2012) A measure of individual role in collective dynamics. Sci Rep 2:292
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW ’10: proceedings of the 19th international conference on world wide web. ACM, New York, pp 591–600
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web 1(1):5-es
Li Q, Zhou T, Lü L, Chen D (2014) Identifying influential spreaders by weighted leaderrank. Physica A 404:47–55
Liu Y, Wu Y-FB (2018) Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In: Thirty-second AAAI conference on artificial intelligence
Liu Y, Tang M, Zhou T, Do Y (2015) Improving the accuracy of the k-shell method by removing redundant links: from a perspective of spreading dynamics. Sci Rep 5:13172
Liu J-G, Lin J-H, Guo Q, Zhou T (2016) Locating influential nodes via dynamics-sensitive centrality. Sci Rep 6:21380
Lou T, Tang J, Hopcroft J, Fang Z, Ding X (2013) Learning to predict reciprocity and triadic closure in social networks. ACM Trans Knowl Discov from Data (TKDD) 7(2):5
Lü L, Zhang Y-C, Yeung CH, Zhou T (2011) Leaders in social networks, the delicious case. PLoS ONE 6(6):e21202
Lü L, Medo M, Yeung CH, Zhang Y-C, Zhang Z-K, Zhou T (2012) Recommender systems. Phys Rep 519(1):1–49
Mahajan V (2010) Innovation diffusion. In: Wiley international encyclopedia of marketing (Part 1. Marketing Strategy). Wiley Online Library
Makice K (2009) Twitter API: up and running: learn how to build applications with the Twitter API. O’Reilly Media, Inc., New York
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146
Martella C (2012) Apache giraph: distributed graph processing in the cloud
Martella C, Shaposhnik R, Logothetis D, Harenberg S (2015) Practical graph analytics with Apache Giraph. Springer, Berlin
Martin N (2020) How social media has changed how we consume news. Online; Accessed 15 June 2020
Massie ML, Chun BN, Culler DE (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput 30(7):817–840
Miller JC, Ting T (2020) Eon (epidemics on networks): a fast, flexible python package for simulation, analytic approximation, and analysis of epidemics on networks. arXiv:2001.02436
Montresor A, De Pellegrini F, Miorandi D (2013) Distributed k-core decomposition. IEEE Trans Parallel Distrib Syst 24(2):288–300
Nadkarni A, Hofmann SG (2012) Why do people use Facebook? Personal Individ Differ 52(3):243–249
Newman MEJ (2005) A measure of betweenness centrality based on random walks. Soc Netw 27(1):39–54
Newman MEJ (2010) Networks: an introduction. Oxford University Press, Oxford
Newman MEJ, Watts DJ, Strogatz SH (2002) Random graph models of social networks. Proc Natl Acad Sci 99(suppl 1):2566–2572
Noether GE (1981) Why Kendall tau? Teach Stat 3(2):41–43
Okamoto K, Chen W, Li X-Y (2008) Ranking of closeness centrality for large-scale social networks. In: International workshop on frontiers in algorithmics. Springer, pp 186–195
Opsahl T, Agneessens F, Skvoretz J (2010) Node centrality in weighted networks: generalizing degree and shortest paths. Soc Netw 32(3):245–251
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120
Pal A, Counts S (2011) Identifying topical authorities in microblogs. In: Proceedings of the fourth ACM international conference on web search and data mining. ACM, pp 45–54
Pastor-Satorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86(14):3200
Pastor-Satorras R, Vespignani A (2002) Immunization of complex networks. Phys Rev E 65(3):036104
Romero DM, Galuba W, Asur S, Huberman BA (2011) Influence and passivity in social media. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 18–33
Sabidussi G (1966) The centrality index of a graph. Psychometrika 31(4):581–603
Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287
Shang S, Hwang K (1995) Distributed hardwired barrier synchronization for scalable multiprocessor clusters. IEEE Trans Parallel Distrib Syst 6(6):591–605
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
Smith D, Moore L, et al (2004) The SIR model for spread of disease: the differential equation model. Loci. (originally Convergence.) https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spread-of-disease-the-differential-equation-model
Thomo A, Liu F (2017) Computation of k-core decomposition on giraph. arXiv:1705.03603
Twitter Developer Team. Twitter API. Online; Accessed 1 Mar 2021
Wang Z, Zhao Y, Xi J, Changjiang D (2016) Fast ranking influential nodes in complex networks using a k-shell iteration factor. Physica A 461:171–181
Weng J, Lim E-P, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on web search and data mining. ACM, pp 261–270
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Technical Background
Appendix A Technical Background
In recent years, the number of users of social network sites have been increased by a large scale. People now can publish their statuses and get connected with other users which we can refer to as social relationship (Garton et al. 1997). As we have already discussed in the earlier section, we can formally represent an online social network using a graph, where each user is represented by a node of the graph and the social relationships among them are represented by edges (directed or undirected) (Newman et al. 2002). In this graph, the nodes play an important role to disseminate information. This knowledge of node spreadability is very significant while it comes to develop an efficient method of accelerating the spreading in the case of information diffusion (Mahajan 2010) as well as decelerate it in the case of diseases (Heesterbeek 2000; Keeling and Rohani 2008). Therefore, in recent years, the microscopic study of spreadability for each node has caught attention of the researchers. Moreover, it can be very useful in case of finding out the initial spreaders of a contagious disease (Anderson et al. 1992) or any information (Fu et al. 2015a). In this section, we define some terminologies related to our work and then we provide a brief description of the existing techniques of finding the most influential spreaders in a network. Finally, we discuss the SIR model (Smith and Moore 2004) which we use to evaluate the merit of our proposed technique.
1.1 Centrality measurement
In graph theory, centrality is a term to describe the importance of an individual vertex within a graph or a network. Centrality measures (Disney 2020) were initially developed for social network analysis since it helps answering the question, “Which vertices are the most influential in a graph?” The popular centrality measures for such analysis can be divided into two types: local centrality measures and global centrality measures, briefly described below.
1.1.1 Local centrality measures
To compute the centrality of a node, local measures generally use the information of a node and its neighborhood only. The number of neighbors (degree of a node) plays the main role in such local measures and they are more suitable for networks modeled as undirected graphs. The two most popular local centrality measures are described below:
(i) Degree centrality Among all the centrality measures, Degree centrality (Disney 2020) is the simplest one. It assumes the nodes with maximum number of neighbors to be the most influential in the network.
If, G(V, E) is a graph with the set of vertices, V and the set of edges, E, then the degree of a vertex v can be denoted by \(d_v\), and is defined as the total number of edges incident to it (i.e., the number of neighbors of v). Now, if the number of vertices, |V| is n, then the degree centrality of node v is given as follows:
Here, \((n-1)\) is used to normalize the value of degree centrality within 0 and 1. The most important reason for using degree centrality for finding the most influential spreaders is its simplicity and low computational complexity. However, this measure typically fails to identify the most influential spreaders accurately. Despite this, there are several cases where degree centrality can provide surprisingly good performance. For example, if the spreading rate is very small, degree centrality is reported to perform better in finding the spreadability of nodes than other well-known centrality measures (Klemm et al. 2012; Liu et al. 2016).
(ii) K-core decomposition In case of degree centrality, the number of neighbors is solely responsible in determining the influence of any node in the network. Later Kitsak et al. (2010) determined that the location of the nodes in the network can impact more significantly in their spreadability. They identified that nodes located at the center of the network possess higher probability to be the most significant spreaders in a network than the nodes located at the perimeter of the network. To sum up, they suggested that the core number of a node should be considered as a more suitable measure in order to identifying the most influential spreaders of the network, and this core number can be determined by the k-core decomposition (Dorogovtsev et al. 2006; Alvarez-Hamelin et al. 2005) of the network.
At the very beginning of the k-core decomposition method, all the nodes with degree 1 are removed. This process is recursively continued until no node with degree 1 is remaining in the network. These removed nodes are grouped together to form 1-core. After that, in a similar way, all the nodes with residual degree 2 are removed recursively until no node with degree 2 is remaining in the network. All these removed node are then grouped together to form 2-core. This process is continued until all the nodes are assigned to some core group. In this conventional k-core decomposition method, nodes having the largest core number are considered to be located at the center of the network and they are assumed to have the most spreadability. Figure 7 presents a simple network and the core number for the nodes.
The drawback of k-core decomposition method is that it has a tendency to assign the same core number (k value) to multiple nodes in case of large networks. Therefore, we may end up having a huge number of nodes as the most influential spreaders, which may not be a desired outcome in many cases. However, the simplicity and lower computational complexity make this measure very useful. We suggest that incorporating user-specific information while computing the core decomposition helps to find the influential spreaders more accurately. Based on this, we propose User Attributed Core Decomposition (UACD) method and show that our proposed method significantly improves the accuracy without incurring noticeable computational overhead.
1.1.2 Global centrality measures
Global measures consider the whole network topology while computing the centrality of the nodes. Some of the most popular global centrality measures are briefly described below:
(i) Closeness centrality Closeness centrality (Disney 2020) is a measurement which identifies the closeness of a node from all the other nodes in the network. If the network can be represented by a connected graph, then the normalized version of closeness centrality of any node u of the graph is calculated as the average of all the shortest paths between u and all other nodes of the network.
Let \(l_{ij}\) be the length of shortest path between any two nodes (\(v_i\), \(v_j\)), and n be the number of nodes in the network, then the average shortest distance of node \(v_i\) from all other nodes, i.e., closeness centrality can be defined as (Sabidussi 1966),
The closeness centrality of node \(v_i\) can be thought as the inverse of the average shortest distance, \(L_i\) and defined as follows:
However, this equation will not be suitable to use for a disconnected graph where some nodes may be unreachable from a node \(v_i\). For graphs with multiple connected components, Wasserman and Faust (Opsahl et al. 2010) revise the definition of closeness centrality. The closeness centrality of node \(v_i\) is now measured as the ratio of “the fraction of the nodes in the network which are reachable from \(v_i\), ” to the “average shortest distance of \(v_i\) from the reachable nodes.” Let \(n_i\) be the number of reachable nodes in the network from node \(v_i\), then the modified formula of measuring closeness centrality is given as follows:
(ii) Betweenness centrality Betweenness centrality (Disney 2020) determines how many times a node falls along the shortest path between two different nodes (i.e., acts as a bridge between those two nodes). Freeman (1977) introduces this measure for quantifying the control of a user on the communication between other users in a social network.
Let \(v_s\), \(v_f\), and \(v_i\) are three different nodes in a network represented by G(V, E). Now we define \(n_{sf}^i = 1\) if node \(v_i\) lies on the shortest path between \(v_s\) and \(v_f\), and 0 otherwise. The betweenness centrality of node \(v_i\) is defined as:
However, there can be more than one shortest path between \(v_s\) and \(v_f\) and that may count a node for centrality measure more than once. For this reason, if total number of shortest paths between \(v_s\) and \(v_f\) is \(g_{sf}\), the equation for finding betweenness centrality of node \(v_i\) is updated as follows:
(iii) Eigenvector centrality Eigenvector centrality (Disney 2020) computes the centrality of a node in a network based on the centrality of its neighbors. The weights of the neighbors are assigned in such a way that a high scoring neighbor contributes more to the centrality of the node. Assume that, a graph G(V, E) is represented by an adjacency matrix \(A = \{a_{ij}\}\), where \(\{a_{ij}\} = 1\) if nodes \(v_i\) and \(v_j\) are neighbors and 0 otherwise.
If \(\lambda\) is the eigenvalue of graph G, the eigenvector centrality for node \(v_i\) is the \(i^{th}\) element of the vector \(\mathbf {x}\) defined by the equation:
There can be multiple eigenvalues, i.e., multiple values of \(\lambda\) and for which multiple solutions can be found. However, for the largest eigenvalue of the adjacency matrix A, according to the Perron–Frobenius theorem, there exists a unique solution of x which contains all positive entries (Newman 2010).
1.2 Susceptible-infected-recovered (SIR) model
The SIR model (Smith and Moore 2004) is one of the widely used models in epidemiology. In an SIR model, a node in the network can be at one of the three states: Susceptible, Infected, and Recovered. For better understanding, we explain these states using “people” instead of “nodes.”
-
S (Susceptible): The group of people who have not been infected with the disease yet. Additionally, they are not immune to the disease, and therefore, they are under the threat of being infected in the future.
-
I (Infected): The group of people who have already been infected with the disease. Moreover, they can transmit the disease to the susceptible neighbors with a probability of \(\beta\).
-
R (Recovered): The group of people who have either recovered from the disease or dead. The recovered people are immune to the disease and no longer can transmit the disease to the susceptible neighbors.
The SIR model is capable of finding the spreadability of a node accurately by considering the node as the only infected node at the beginning, running the simulation multiple times, and considering the average. If the network contains n nodes, one needs to simulate the entire network n times (with multiple runs) to find the spreadability of all the nodes. Although this is the most accurate method of finding the most influential spreaders, it is not realizable for a large network due to its high computational cost.
Rights and permissions
About this article
Cite this article
Adnan, T.M.T., Islam, M.S., Papon, T.I. et al. UACD: A Local Approach for Identifying the Most Influential Spreaders in Twitter in a Distributed Environment. Soc. Netw. Anal. Min. 12, 37 (2022). https://doi.org/10.1007/s13278-022-00862-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-022-00862-3