Abstract
Preferential attachment is an appealing edge generating mechanism for modeling social networks. It provides both an intuitive description of network growth and an explanation for the observed power laws in degree distributions. However, there are often difficulties fitting parametric network models to data due to either model error or data corruption. In this paper, we consider semi-parametric estimation based on an extreme value approach that begins by estimating tail indices of the power laws of in- and out-degree for the nodes of the network using nodes with large in- and out-degree. This method uses tail behavior of both the marginal and joint degree distributions. We compare the extreme value method with the existing parametric approaches and demonstrate how it can provide more robust estimates of parameters associated with the network when the data are corrupted or when the model is misspecified.
Similar content being viewed by others
References
Bhamidi, S.: Universal techniques to analyze preferential attachment trees: global and local analysis. available: http://www.unc.edu/bhamidi/preferent.pdf. Preprint (2007)
Bhamidi, S., Steele, J.M., Zaman, T.: Twitter event networks and the superstar model. Ann. Appl. Probab. 10(5), 2462–2502 (2015)
Bollobás, B., Borgs, C., Chayes, J., Riordan, O.: Directed scale-free graphs. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, 2003), pp 132–139. ACM, New York (2003)
Chandler, R.E., Bate, S.: Inference for clustered data using the independence log- likelihood. Biometrika 95, 167–183 (2007)
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)
Coles, S.G.: An introduction to statistical modeling of extreme values. Springer Series in Statistics, p xiv 210. Springer, London (2001)
Das, B., Mitra, A., Resnick, S.: Living on the multi-dimensional edge: seeking hidden risks using regular variation. Adv. Appl. Probab. 45(1), 139–163 (2013)
de Haan, L., Ferreira, A.: Extreme value theory: an introduction. Springer, New York (2006)
Drees, H., Janßen, A., Resnick, S.I., Wang, T.: On a minimum distance procedure for threshold selection in tail analysis. ArXiv e-prints. Submitted (2018)
Durrett, R.T.: Random graph dynamics. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge (2010)
Easley, D., Kleinberg, J.: Networks, crowds, and markets. Cambridge University Press, Cambridge (2010)
Gao, F., van der Vaart, A.: On the asymptotic normality of estimating the affine preferential attachment network models with random initial degrees. Stochastic Process Appl. 127(11), 3754–3775 (2017)
Gillespie, C.S.: Fitting heavy tailed distributions: the poweRlaw package. J. Stat. Softw. 64(2), 1–16 (2015)
Hill, B.M.: A simple general approach to inference about the tail of a distribution. Ann. Statist. 3, 1163–1174 (1975)
Hult, H., Lindskog, F.: Regular variation for measures on metric spaces. Publ. Inst Math. (Beograd) (N.S.) 80(94), 121–140 (2006)
Kolaczyk, E.D., Csárdi, G.: Statistical analysis of network data with R. Use R!. Springer, New York (2014)
Krapivsky, P., Rodgers, G., Redner, S.: Degree distributions of growing networks. Phys. Rev. Lett 86, 5401–5404 (2001)
Krapivsky, P.L., Redner, S.: Organization of growing random networks. Physical Review E 63(6), 066123:1–14 (2001)
Kunegis, J.: Konect: the Koblenz network collection. In: Proceedings of the 22nd International Conference on World Wide Web, pp 1343–1350. ACM (2013)
Lindskog, F., Resnick, S.I., Roy, J.: Regularly varying measures on metric spaces: hidden regular variation and hidden jumps. Probab. Surv. 11, 270–314 (2014)
Resnick, S.I.: Heavy-tail phenomena: probabilistic and statistical modeling. Springer series in operations research and financial engineering. Springer, New York (2007). ISBN: 0-387-24272-4
Resnick, S.I., Samorodnitsky, G.: Tauberian theory for multivariate regularly varying distributions with application to preferential attachment networks. Extremes 18(3), 349–367 (2015)
Samorodnitsky, G., Resnick, S., Towsley, D., Davis, R., Willis, A., Wan, P.: Nonstandard regular variation of in-degree and out-degree in the preferential attachment model. J. Appl. Probab. 53(1), 146–161 (2016)
van der Hofstad, R.: Random graphs and complex networks, vol. 1. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2017)
Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods Statist. Sinica 21, 5–42 (2011)
Wan, P., Wang, T., Davis, R.A., Resnick, S.I.: Fitting the linear preferential attachment model. Electron. J Statist. 11(2), 3738–3780 (2017)
Wang, T., Resnick, S.I.: Multivariate regular variation of discrete mass functions with applications to preferential attachment networks. Methodol. Comput. Appl. Probab. 20(3), 1029–104 (2018)
Wang, T., Resnick, S.I.: Consistency of Hill estimators in a linear preferential attachment model extremes. https://doi.org/10.1007/s10687-018-0335-7 (2018)
Wang, T., Resnick, S.I.: Degree growth rates and index estimation in a directed preferential attachment model. ArXiv e-prints. Submitted (2018)
Acknowledgments
Research of the four authors was partially supported by Army MURI grant W911NF-12-1-0385 to Cornell University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Parameter estimation for linear PA model
Parameter estimation for the linear PA model was studied in Wan et al. (2017). If the complete history of the network evolution is available (i.e., timestamps of edge creation are known), then MLE estimates exist and are computable. On the other hand, if only a snapshot of the network is given at a single point in time (i.e., timestamp information for the creation of the edges is unavailable), an approximate MLE was proposed. This procedure combined elements of method of moments with an approximation to the likelihood. In the following we provide a brief summary of these two estimation methods. Asymptotic properties of these estimators can be found in Wan et al. (2017).
1.1 A.1 MLE
Given the full evolution of the network G(n), assuming the graph began with n0 initial edges, the MLE estimator of 𝜃 = (α,β,γ,δin,δout),
is obtained by setting
and solving for \((\hat {\delta }_{\text {in}}^{MLE}, \hat {\delta }_{\text {out}}^{MLE})\) from
where
By Wan et al. (2017, Theorem 3.3), \(\widehat {\boldsymbol {\theta }}_{n}^{MLE}\) is strongly consistent, asymptotically normal and efficient.
1.2 A.2 Snapshot
The estimation method for 𝜃 from the snapshot G(n) is summarized in the following 7-step procedure:
- 1.
Estimate β by \(\hat {\beta }^{SN}=1-N(n)/n\).
- 2.
Obtain \(\hat {\delta }_{\text {in}}^{0}\) by solving
$$ \sum\limits_{i=1}^{\infty} \frac{N^{\text{in}}_{>i}(n)}{n}\frac{i}{i+\hat{\delta}_{\text{in}}^{0}}(1+\hat{\delta}_{\text{in}}^{0}(1-\hat{\beta}^{SN})) =\frac{\frac{N^{\text{in}}_{0}(n)}{n} + \hat{\beta}^{SN} }{1-\frac{N^{\text{in}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{in}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{0}}}, $$where \(N^{\text {in}}_{0}(n)\) denotes the number of nodes with in-degree 0 in G(n).
- 3.
Estimate α by
$$ \hat\alpha^{0} = \frac{\frac{N^{\text{in}}_{0}(n)}{n} +\hat{\beta}^{SN}}{1-\frac{N^{\text{in}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{in}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{0}}} - \hat{\beta}^{SN}. $$ - 4.
Obtain \(\hat {\delta }_{\text {out}}^{0}\) by solving
$$ \sum\limits_{j=1}^{\infty} \frac{N^{\text{out}}_{>j}(n)}{n}\frac{j}{j+\hat{\delta}_{\text{out}}^{0}}(1+\hat{\delta}_{\text{out}}^{0}(1-\hat{\beta}^{SN})) = \frac{\frac{N^{\text{out}}_{0}(n)}{n} + \hat{\beta}^{SN} }{1-\frac{N^{\text{out}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{out}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{0}}}, $$where \(N^{\text {out}}_{0}(n)\) denotes the number of nodes with out-degree 0 in G(n).
- 5.
Estimate γ by
$$ \hat\gamma^{0} = \frac{\frac{N^{\text{out}}_{0}(n)}{n} + \hat{\beta}^{SN}}{1-\frac{N^{\text{out}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{out}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{0}}} - \hat{\beta}^{SN}. $$ - 6.
Re-normalize the probabilities
$$ (\hat\alpha^{SN},\hat{\beta}^{SN},\hat\gamma^{SN}) \leftarrow \left( \frac{\hat\alpha^{0}(1-\hat{\beta}^{SN})}{\hat\alpha^{0}+\hat\gamma^{0}},\hat{\beta}^{SN},\frac{\hat\gamma^{0}(1-\hat{\beta}^{SN})}{\hat\alpha^{0}+\hat\gamma^{0}}\right). $$ - 7.
Solve for \(\hat {\delta }_{\text {in}}^{SN}\) from
$$ \sum\limits_{i=0}^{\infty} \frac{N^{\text{in}}_{>i}(n)/n}{i+\hat{\delta}_{\text{in}}^{SN}}-\frac{1-\hat\alpha^{SN}-\hat{\beta}^{SN}}{\hat{\delta}_{\text{in}}^{SN}} -\frac{(\hat\alpha^{SN}+\hat{\beta}^{SN})(1-\hat{\beta}^{SN})}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{SN}} = 0. $$Similarly, solve for \(\hat {\delta }_{\text {out}}^{SN}\) from
$$ \sum\limits_{j=0}^{\infty} \frac{N^{\text{out}}_{>j}(n)/n}{j+\hat{\delta}_{\text{out}}^{SN}}-\frac{1-\hat\gamma^{SN}-\hat{\beta}^{SN}}{\hat{\delta}_{\text{out}}^{SN}} -\frac{(\hat\gamma^{SN}+\hat{\beta}^{SN})(1-\hat{\beta}^{SN})}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{SN}} = 0. $$
Note that Step 6 ensures that
It is shown in Wan et al.(2017, Theorem 4.1) that \(\widehat {\boldsymbol \theta }^{SN}_{n} := (\hat \alpha ^{SN},\hat \beta ^{SN},\hat \gamma ^{SN}, \hat {\delta }_{\text {in}}^{SN}, \hat {\delta }_{\text {out}}^{SN})\overset {\text {a.s.}}{\longrightarrow } \boldsymbol \theta \). Its asymptotic normality and efficiency are analyzed through simulation studies in the same paper.
Appendix B: Proof of theorem 2.1
Proof
We first prove the out-degree part of Theorem 2.1. Note that
Meanwhile, by the definition of V0(n), we have
Applying the arguments in the proof of Theorem 3.1 of Bollobás et al. (2003), it follows that the out-degree distribution of a linear superstar model coincides with that of a standard linear preferential attachment network with parameters (α,β,γ,δin,δout). Moreover,
where {qj out} := {pj out} is the limiting out-degree distribution of PA(α,β,γ,δin,δout). In particular,
for \(C_{\text {out}}^{\prime }\) positive and
Next we consider the in-degree counts of non-superstar nodes. Observe also from the construction of the superstar model that
Applying the Chernoff bound to both Eqs. B.2 and B.3 gives
Taking expectation on both sides of Eq. B.1 then gives
By the rule of the superstar model, given G(n), \(N^{\text {in}}_{i}(n)\) will increase by 1 if either scenario (1b) or (2b) happens and a node with \(I_{n}^{(n)}(v) = i-1\) is chosen as the ending point of the edge. Also, it will decrease by 1 if either scenario (1b) or (2b) happens, but a node with \(I_{n}^{(n)}(v) = i\) is chosen as the ending point of the edge. Moreover, with probability α a new node with in-degree 0 will be added to the graph, and with probability γ a new node with in-degree 1 is created in the next step. Hence, \(\{N^{\text {in}}_{i}(n)\}_{n\ge 1}\) satisfies the following:
Now let \(q^{\text {in}}_{-1} = 0\), and define \(\{q^{\text {in}}_{i}\}_{i\ge 0}\) by
According to the approximation in Eq. B.4, we use the same proof technique as in Bollobás et al. (2003, Theorem 3.1) to obtain
Also, solving the recursion in Eq. B.5 yields
where
Therefore, applying Stirling’s approximation to Eqs. B.6–B.7 gives
for some positive constant \(C^{\prime }_{\text {in}}\). This completes the proof. □
Rights and permissions
About this article
Cite this article
Wan, P., Wang, T., Davis, R.A. et al. Are extreme value estimation methods useful for network data?. Extremes 23, 171–195 (2020). https://doi.org/10.1007/s10687-019-00359-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10687-019-00359-x