[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Are extreme value estimation methods useful for network data?

  • Published:
Extremes Aims and scope Submit manuscript

Abstract

Preferential attachment is an appealing edge generating mechanism for modeling social networks. It provides both an intuitive description of network growth and an explanation for the observed power laws in degree distributions. However, there are often difficulties fitting parametric network models to data due to either model error or data corruption. In this paper, we consider semi-parametric estimation based on an extreme value approach that begins by estimating tail indices of the power laws of in- and out-degree for the nodes of the network using nodes with large in- and out-degree. This method uses tail behavior of both the marginal and joint degree distributions. We compare the extreme value method with the existing parametric approaches and demonstrate how it can provide more robust estimates of parameters associated with the network when the data are corrupted or when the model is misspecified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bhamidi, S.: Universal techniques to analyze preferential attachment trees: global and local analysis. available: http://www.unc.edu/bhamidi/preferent.pdf. Preprint (2007)

  • Bhamidi, S., Steele, J.M., Zaman, T.: Twitter event networks and the superstar model. Ann. Appl. Probab. 10(5), 2462–2502 (2015)

    Article  MathSciNet  Google Scholar 

  • Bollobás, B., Borgs, C., Chayes, J., Riordan, O.: Directed scale-free graphs. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, 2003), pp 132–139. ACM, New York (2003)

  • Chandler, R.E., Bate, S.: Inference for clustered data using the independence log- likelihood. Biometrika 95, 167–183 (2007)

    Article  Google Scholar 

  • Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

    Article  MathSciNet  Google Scholar 

  • Coles, S.G.: An introduction to statistical modeling of extreme values. Springer Series in Statistics, p xiv 210. Springer, London (2001)

    MATH  Google Scholar 

  • Das, B., Mitra, A., Resnick, S.: Living on the multi-dimensional edge: seeking hidden risks using regular variation. Adv. Appl. Probab. 45(1), 139–163 (2013)

    Article  MathSciNet  Google Scholar 

  • de Haan, L., Ferreira, A.: Extreme value theory: an introduction. Springer, New York (2006)

    Book  Google Scholar 

  • Drees, H., Janßen, A., Resnick, S.I., Wang, T.: On a minimum distance procedure for threshold selection in tail analysis. ArXiv e-prints. Submitted (2018)

  • Durrett, R.T.: Random graph dynamics. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge (2010)

    Google Scholar 

  • Easley, D., Kleinberg, J.: Networks, crowds, and markets. Cambridge University Press, Cambridge (2010)

    Book  Google Scholar 

  • Gao, F., van der Vaart, A.: On the asymptotic normality of estimating the affine preferential attachment network models with random initial degrees. Stochastic Process Appl. 127(11), 3754–3775 (2017)

    Article  MathSciNet  Google Scholar 

  • Gillespie, C.S.: Fitting heavy tailed distributions: the poweRlaw package. J. Stat. Softw. 64(2), 1–16 (2015)

    Article  Google Scholar 

  • Hill, B.M.: A simple general approach to inference about the tail of a distribution. Ann. Statist. 3, 1163–1174 (1975)

    Article  MathSciNet  Google Scholar 

  • Hult, H., Lindskog, F.: Regular variation for measures on metric spaces. Publ. Inst Math. (Beograd) (N.S.) 80(94), 121–140 (2006)

    Article  MathSciNet  Google Scholar 

  • Kolaczyk, E.D., Csárdi, G.: Statistical analysis of network data with R. Use R!. Springer, New York (2014)

    Book  Google Scholar 

  • Krapivsky, P., Rodgers, G., Redner, S.: Degree distributions of growing networks. Phys. Rev. Lett 86, 5401–5404 (2001)

    Article  Google Scholar 

  • Krapivsky, P.L., Redner, S.: Organization of growing random networks. Physical Review E 63(6), 066123:1–14 (2001)

    Article  Google Scholar 

  • Kunegis, J.: Konect: the Koblenz network collection. In: Proceedings of the 22nd International Conference on World Wide Web, pp 1343–1350. ACM (2013)

  • Lindskog, F., Resnick, S.I., Roy, J.: Regularly varying measures on metric spaces: hidden regular variation and hidden jumps. Probab. Surv. 11, 270–314 (2014)

    Article  MathSciNet  Google Scholar 

  • Resnick, S.I.: Heavy-tail phenomena: probabilistic and statistical modeling. Springer series in operations research and financial engineering. Springer, New York (2007). ISBN: 0-387-24272-4

    Google Scholar 

  • Resnick, S.I., Samorodnitsky, G.: Tauberian theory for multivariate regularly varying distributions with application to preferential attachment networks. Extremes 18(3), 349–367 (2015)

    Article  MathSciNet  Google Scholar 

  • Samorodnitsky, G., Resnick, S., Towsley, D., Davis, R., Willis, A., Wan, P.: Nonstandard regular variation of in-degree and out-degree in the preferential attachment model. J. Appl. Probab. 53(1), 146–161 (2016)

    Article  MathSciNet  Google Scholar 

  • van der Hofstad, R.: Random graphs and complex networks, vol. 1. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  • Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods Statist. Sinica 21, 5–42 (2011)

    MathSciNet  MATH  Google Scholar 

  • Wan, P., Wang, T., Davis, R.A., Resnick, S.I.: Fitting the linear preferential attachment model. Electron. J Statist. 11(2), 3738–3780 (2017)

    Article  MathSciNet  Google Scholar 

  • Wang, T., Resnick, S.I.: Multivariate regular variation of discrete mass functions with applications to preferential attachment networks. Methodol. Comput. Appl. Probab. 20(3), 1029–104 (2018)

    Article  MathSciNet  Google Scholar 

  • Wang, T., Resnick, S.I.: Consistency of Hill estimators in a linear preferential attachment model extremes. https://doi.org/10.1007/s10687-018-0335-7 (2018)

  • Wang, T., Resnick, S.I.: Degree growth rates and index estimation in a directed preferential attachment model. ArXiv e-prints. Submitted (2018)

Download references

Acknowledgments

Research of the four authors was partially supported by Army MURI grant W911NF-12-1-0385 to Cornell University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiandong Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Parameter estimation for linear PA model

Parameter estimation for the linear PA model was studied in Wan et al. (2017). If the complete history of the network evolution is available (i.e., timestamps of edge creation are known), then MLE estimates exist and are computable. On the other hand, if only a snapshot of the network is given at a single point in time (i.e., timestamp information for the creation of the edges is unavailable), an approximate MLE was proposed. This procedure combined elements of method of moments with an approximation to the likelihood. In the following we provide a brief summary of these two estimation methods. Asymptotic properties of these estimators can be found in Wan et al. (2017).

1.1 A.1 MLE

Given the full evolution of the network G(n), assuming the graph began with n0 initial edges, the MLE estimator of 𝜃 = (α,β,γ,δin,δout),

$$ \widehat{\boldsymbol{\theta}}_{n}^{MLE}:=(\hat{\alpha}^{MLE}, \hat{\beta}^{MLE}, \hat{\gamma}^{MLE}, \hat{\delta}_{\text{in}}^{MLE},\hat{\delta}_{\text{out}}^{MLE}), $$

is obtained by setting

$$ \begin{array}{@{}rcl@{}} \hat{\alpha}^{MLE} &=& \frac{1}{n-n_{0}} \sum\limits_{t=n_{0}+1}^{n} \textbf{1}_{\{J_{t}=1\}},\\ \hat{\beta}^{MLE} &=& \frac{1}{n-n_{0}} \sum\limits_{t=n_{0}+1}^{n} \textbf{1}_{\{J_{t}=2\}},\\ \hat{\gamma}^{MLE} &=& 1- \hat{\alpha}^{MLE} - \hat{\beta}^{MLE}, \end{array} $$

and solving for \((\hat {\delta }_{\text {in}}^{MLE}, \hat {\delta }_{\text {out}}^{MLE})\) from

$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=0}^{\infty} \frac{N^{\text{in}}_{>i}(n) - N^{\text{in}}_{>i}(n_{0})}{i+\hat{\delta}_{\text{in}}^{MLE}} &= \frac{n-n_{0}}{\hat{\delta}_{\text{in}}^{MLE}}\hat{\gamma}^{MLE} +\sum\limits_{t=n_{0}+1}^{n} \frac{N(t-1)}{t-1+\hat{\delta}_{\text{in}}^{MLE} N(t-1)}\textbf{1}_{\{J_{t}\in\lbrace 1, 2\rbrace\}},\\ \sum\limits_{j=0}^{\infty} \frac{N^{\text{out}}_{>j}(n) - N^{\text{out}}_{>j}(n_{0})}{j+\hat{\delta}_{\text{out}}^{MLE}} &= \frac{n-n_{0}}{\hat{\delta}_{\text{out}}^{MLE}}\hat{\alpha}^{MLE} +\sum\limits_{t=n_{0}+1}^{n} \frac{N(t-1)}{t-1+\hat{\delta}_{\text{out}}^{MLE} N(t-1)}\textbf{1}_{\{J_{t}\in\lbrace 2, 3\rbrace\}}, \end{array} $$

where

$$ N^{\text{in}}_{>i}(n) := \sum\limits_{i^{\prime}>i} N^{\text{in}}_{i^{\prime}}(n),\qquad N^{\text{out}}_{>j}(n) := \sum\limits_{j^{\prime}>j} N^{\text{out}}_{j^{\prime}}(n). $$

By Wan et al. (2017, Theorem 3.3), \(\widehat {\boldsymbol {\theta }}_{n}^{MLE}\) is strongly consistent, asymptotically normal and efficient.

1.2 A.2 Snapshot

The estimation method for 𝜃 from the snapshot G(n) is summarized in the following 7-step procedure:

  1. 1.

    Estimate β by \(\hat {\beta }^{SN}=1-N(n)/n\).

  2. 2.

    Obtain \(\hat {\delta }_{\text {in}}^{0}\) by solving

    $$ \sum\limits_{i=1}^{\infty} \frac{N^{\text{in}}_{>i}(n)}{n}\frac{i}{i+\hat{\delta}_{\text{in}}^{0}}(1+\hat{\delta}_{\text{in}}^{0}(1-\hat{\beta}^{SN})) =\frac{\frac{N^{\text{in}}_{0}(n)}{n} + \hat{\beta}^{SN} }{1-\frac{N^{\text{in}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{in}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{0}}}, $$

    where \(N^{\text {in}}_{0}(n)\) denotes the number of nodes with in-degree 0 in G(n).

  3. 3.

    Estimate α by

    $$ \hat\alpha^{0} = \frac{\frac{N^{\text{in}}_{0}(n)}{n} +\hat{\beta}^{SN}}{1-\frac{N^{\text{in}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{in}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{0}}} - \hat{\beta}^{SN}. $$
  4. 4.

    Obtain \(\hat {\delta }_{\text {out}}^{0}\) by solving

    $$ \sum\limits_{j=1}^{\infty} \frac{N^{\text{out}}_{>j}(n)}{n}\frac{j}{j+\hat{\delta}_{\text{out}}^{0}}(1+\hat{\delta}_{\text{out}}^{0}(1-\hat{\beta}^{SN})) = \frac{\frac{N^{\text{out}}_{0}(n)}{n} + \hat{\beta}^{SN} }{1-\frac{N^{\text{out}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{out}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{0}}}, $$

    where \(N^{\text {out}}_{0}(n)\) denotes the number of nodes with out-degree 0 in G(n).

  5. 5.

    Estimate γ by

    $$ \hat\gamma^{0} = \frac{\frac{N^{\text{out}}_{0}(n)}{n} + \hat{\beta}^{SN}}{1-\frac{N^{\text{out}}_{0}(n)}{n} \frac{\hat{\delta}_{\text{out}}^{0}}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{0}}} - \hat{\beta}^{SN}. $$
  6. 6.

    Re-normalize the probabilities

    $$ (\hat\alpha^{SN},\hat{\beta}^{SN},\hat\gamma^{SN}) \leftarrow \left( \frac{\hat\alpha^{0}(1-\hat{\beta}^{SN})}{\hat\alpha^{0}+\hat\gamma^{0}},\hat{\beta}^{SN},\frac{\hat\gamma^{0}(1-\hat{\beta}^{SN})}{\hat\alpha^{0}+\hat\gamma^{0}}\right). $$
  7. 7.

    Solve for \(\hat {\delta }_{\text {in}}^{SN}\) from

    $$ \sum\limits_{i=0}^{\infty} \frac{N^{\text{in}}_{>i}(n)/n}{i+\hat{\delta}_{\text{in}}^{SN}}-\frac{1-\hat\alpha^{SN}-\hat{\beta}^{SN}}{\hat{\delta}_{\text{in}}^{SN}} -\frac{(\hat\alpha^{SN}+\hat{\beta}^{SN})(1-\hat{\beta}^{SN})}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{in}}^{SN}} = 0. $$

    Similarly, solve for \(\hat {\delta }_{\text {out}}^{SN}\) from

    $$ \sum\limits_{j=0}^{\infty} \frac{N^{\text{out}}_{>j}(n)/n}{j+\hat{\delta}_{\text{out}}^{SN}}-\frac{1-\hat\gamma^{SN}-\hat{\beta}^{SN}}{\hat{\delta}_{\text{out}}^{SN}} -\frac{(\hat\gamma^{SN}+\hat{\beta}^{SN})(1-\hat{\beta}^{SN})}{1+(1-\hat{\beta}^{SN})\hat{\delta}_{\text{out}}^{SN}} = 0. $$

Note that Step 6 ensures that

$$ \hat\alpha^{SN}+\hat{\beta}^{SN}+\hat\gamma^{SN}=1. $$

It is shown in Wan et al.(2017, Theorem 4.1) that \(\widehat {\boldsymbol \theta }^{SN}_{n} := (\hat \alpha ^{SN},\hat \beta ^{SN},\hat \gamma ^{SN}, \hat {\delta }_{\text {in}}^{SN}, \hat {\delta }_{\text {out}}^{SN})\overset {\text {a.s.}}{\longrightarrow } \boldsymbol \theta \). Its asymptotic normality and efficiency are analyzed through simulation studies in the same paper.

Appendix B: Proof of theorem 2.1

Proof

We first prove the out-degree part of Theorem 2.1. Note that

$$ \begin{array}{@{}rcl@{}} \textbf{E}\left( N^{\text{out}}_{j}(n+1)| G(n)\right)& =& N^{\text{out}}_{j}(n)+\gamma\boldsymbol{1}_{\{j=0\}}+\alpha\boldsymbol{1}_{\{j=1\}}\\ &&+ (\beta+\gamma) \left( N^{\text{out}}_{j-1}(n)\frac{j-1+\delta_{\text{out}}}{n+\delta_{\text{out}}|V^{0}(n)|} - N^{\text{out}}_{j}(n)\frac{j+\delta_{\text{out}}}{n+\delta_{\text{out}}|V^{0}(n)|}\right).\\ \end{array} $$
(B.1)

Meanwhile, by the definition of V0(n), we have

$$ |V^0(n)|+1 = N(n) \sim \text{Binomial}(n, 1-\beta). $$
(B.2)

Applying the arguments in the proof of Theorem 3.1 of Bollobás et al. (2003), it follows that the out-degree distribution of a linear superstar model coincides with that of a standard linear preferential attachment network with parameters (α,β,γ,δin,δout). Moreover,

$$ \frac{N^{\text{out}}_{j}(n)}{n} \overset{\text{a.s.}}{\longrightarrow} q^{\text{out}}_{j}, \quad j>0,\quad n\to\infty, $$

where {qj out} := {pj out} is the limiting out-degree distribution of PA(α,β,γ,δin,δout). In particular,

$$ q^{\text{out}}_{j} \sim C_{\text{out}}^{\prime} j^{-(1+\iota_{\text{out}})}\qquad\text{ as}j\to\infty, $$

for \(C_{\text {out}}^{\prime }\) positive and

$$ \iota_{\text{out}}^{-1} = \frac{\beta+\gamma}{1+\delta_{\text{out}}(\alpha+\gamma)}. $$

Next we consider the in-degree counts of non-superstar nodes. Observe also from the construction of the superstar model that

$$ |E^0(n)| \sim \text{Binomial}(n, 1-(\alpha+\beta)p). $$
(B.3)

Applying the Chernoff bound to both Eqs. B.2 and B.3 gives

$$ \begin{array}{@{}rcl@{}} \left|V^{0}(n)\right| &=& (1-\beta)n + O(n^{1/2}\log n),\\ \left|E^{0}(n)\right| &=& (1-(\alpha+\beta)p)n + O(n^{1/2}\log n). \end{array} $$

Taking expectation on both sides of Eq. B.1 then gives

$$ \begin{array}{@{}rcl@{}} &&\textbf{E} \left( (\alpha+\beta)(1-p)N^{\text{in}}_{i}(n)\frac{i+\delta_{\text{in}}}{|E^{0}(n)|+\delta_{\text{in}} |V^{0}(n)|}\right)\\ &&\quad= (\alpha+\beta)(1 - p) \frac{i+\delta_{\text{in}}}{n(1 - (\alpha+\beta)p)+\delta_{\text{in}} \cdot n(1 - \beta)} \textbf{E}(N^{\text{in}}_{i}(n)) + O(n^{-1/2}\log n).\\ \end{array} $$
(B.4)

By the rule of the superstar model, given G(n), \(N^{\text {in}}_{i}(n)\) will increase by 1 if either scenario (1b) or (2b) happens and a node with \(I_{n}^{(n)}(v) = i-1\) is chosen as the ending point of the edge. Also, it will decrease by 1 if either scenario (1b) or (2b) happens, but a node with \(I_{n}^{(n)}(v) = i\) is chosen as the ending point of the edge. Moreover, with probability α a new node with in-degree 0 will be added to the graph, and with probability γ a new node with in-degree 1 is created in the next step. Hence, \(\{N^{\text {in}}_{i}(n)\}_{n\ge 1}\) satisfies the following:

$$ \begin{array}{@{}rcl@{}} \textbf{E}\left( N^{\text{in}}_{i}(n+1)|G(n)\right) &= & N^{\text{in}}_{i}(n) + \alpha\boldsymbol{1}_{\{i=0\}}+\gamma\boldsymbol{1}_{\{i=1\}}\\ &&+ (\alpha+\beta)(1-p)N^{\text{in}}_{i-1}(n)\frac{i-1+\delta_{\text{in}}}{|E^{0}(n)|+\delta_{\text{in}}|V^{0}(n)|}\\ &&- (\alpha+\beta)(1-p)N^{\text{in}}_{i}(n)\frac{i+\delta_{\text{in}}}{|E^{0}(n)|+\delta_{\text{in}} |V^{0}(n)|}. \end{array} $$

Now let \(q^{\text {in}}_{-1} = 0\), and define \(\{q^{\text {in}}_{i}\}_{i\ge 0}\) by

$$ \begin{array}{@{}rcl@{}} q^{\text{in}}_i &=& \frac{(\alpha+\beta)(1-p)}{1-(\alpha+\beta)p+\delta_{\text{in}}(\alpha+\gamma)} \left( (i-1+\delta_{\text{in}})q^{\text{in}}_{i-1} - (i+\delta_{\text{in}})q^{\text{in}}_i\right)\\ &&+ \alpha\boldsymbol{1}_{\{i=0\}}+\gamma\boldsymbol{1}_{\{i=1\}}. \end{array} $$
(B.5)

According to the approximation in Eq. B.4, we use the same proof technique as in Bollobás et al. (2003, Theorem 3.1) to obtain

$$ \frac{N^{\text{in}}_{i}(n)}{n}\overset{\text{a.s.}}{\longrightarrow} q^{\text{in}}_{i},\qquad\text{as}\quad n\to\infty. $$

Also, solving the recursion in Eq. B.5 yields

$$ \begin{array}{@{}rcl@{}} q^{\text{in}}_{0} &=& \frac{\alpha}{1+\iota_{\text{in}}^{-1} \delta_{\text{in}}}, \end{array} $$
(B.6)
$$ \begin{array}{@{}rcl@{}} q^{\text{in}}_{1} &=& (1+\delta_{\text{in}}+\iota_{\text{in}})^{-1}\left( \frac{\alpha\delta_{\text{in}}}{1+\iota_{\text{in}}^{-1}\delta_{\text{in}}}+\frac{\gamma}{\iota_{\text{in}}^{-1}}\right),\\ q^{\text{in}}_{i} &=& \frac{{\Gamma}(i+\delta_{\text{in}})}{{\Gamma}(i+\delta_{\text{in}}+\iota_{\text{in}}+1)} \frac{{\Gamma}(2+\delta_{\text{in}}+\iota_{\text{in}})}{{\Gamma}(1+\delta_{\text{in}})} q^{\text{in}}_{1}, \quad i\ge2, \end{array} $$
(B.7)

where

$$ \iota_{\text{in}}^{-1} := \frac{(\alpha+\beta)(1-p)}{1-(\alpha+\beta)p+\delta_{\text{in}}(\alpha+\gamma)}. $$

Therefore, applying Stirling’s approximation to Eqs. B.6B.7 gives

$$ q^{\text{in}}_{i} \sim C^{\prime}_{\text{in}} i^{-(1+\iota_{\text{in}})}, \qquad \text{as}n\to\infty, $$

for some positive constant \(C^{\prime }_{\text {in}}\). This completes the proof. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, P., Wang, T., Davis, R.A. et al. Are extreme value estimation methods useful for network data?. Extremes 23, 171–195 (2020). https://doi.org/10.1007/s10687-019-00359-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10687-019-00359-x

Keywords

AMS 2000 Subject Classifications

Navigation