[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

Fast network discovery on sequence data via time-aware hashing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Discovering and analyzing networks from non-network data is a task with applications in fields as diverse as neuroscience, genomics, climate science, economics, and more. In domains where networks are discovered on multiple time series, the most common approach is to compute measures of association or similarity between all pairs of time series. The nodes in the resultant network correspond to time series, which are linked by edges weighted according to the association scores of their endpoints. Finally, the fully connected network is thresholded such that only the edges with stronger weights remain and the desired sparsity level is achieved. While this approach is feasible for small datasets, its quadratic (or higher) time complexity does not scale as the individual time series length and the number of compared series increase. Thus, to circumvent the inefficient and wasteful intermediary step of building a fully connected graph before network sparsification, we propose a fast network discovery approach based on probabilistic hashing. Our methods emphasize consecutiveness, or the intuition that time series following similar fluctuations in longer time-consecutive intervals are more similar overall. Evaluation on real data shows that our method can build graphs nearly 15 times faster than baselines (when the baselines do not run out of memory), while achieving accuracy comparable to, or better than, baselines in task-based evaluation. Furthermore, our proposals are general, modular, and may be applied to a variety of sequence similarity search tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Discov 29(3):626–688

    Article  MathSciNet  Google Scholar 

  2. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM 51(1):117–122

    Article  Google Scholar 

  3. Ashkenazy Y, Ivanov PC, Havlin S, Peng C-K, Goldberger AL, Stanley HE (2001) Magnitude and sign correlations in heartbeat fluctuations. Phys Rev Lett 86(9):1900–1903

    Article  Google Scholar 

  4. Balakrishnan N, Koutras M (2002) Runs and scans with applications. Wiley, Hoboken

    MATH  Google Scholar 

  5. Bassett D, Bullmore E (2009) Human brain networks in health and disease. Curr Opin Neurol 22(4):340–347

    Article  Google Scholar 

  6. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on world wide web, pp 131–140

  7. Brugere I, Gallagher B, Berger-Wolf TY (2018) Network structure inference, a survey: motivations, methods, and applications. ACM Comput Surv (CSUR) 51(2):24

    Article  Google Scholar 

  8. Bullmore E, Sporns O (2009) Complex brain networks: graph theoretical analysis of structural and functional systems. Nat Rev Neurosci 10(3):186–198

    Article  Google Scholar 

  9. Center for Biomedical Research Excellence (2012) http://fcon\_1000.projects.nitrc.org/indi/retro/cobre.html

  10. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering. ICDE ’06

  11. Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/. Accessed 1 Jan 2017

  12. Dai Z, He Y (2014) Disrupted structural and functional brain connectomes in mild cognitive impairment and Alzheimer’s disease. Neurosci Bull 30(2):217–232

    Article  Google Scholar 

  13. Davidson I, Gilpin S, Carmichael O, Walker P (2013) Network discovery via constrained tensor analysis of fmri data. In: KDD, pp 194–202

  14. Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web, ACM, pp 577–586

  15. Friston KJ (2011) Functional and effective connectivity: a review. Brain Connect 1(1):13–36

    Article  MathSciNet  Google Scholar 

  16. Hallac D, Park Y, Boyd S, Leskovec J (2017) Network inference via the time-varying graphical lasso. In: ‘KDD’

  17. Heimann M, Lee W, Pan S, Chen K, Koutra D (2018) Hashalign: Hash-based alignment of multiple graphs. In: Advances in knowledge discovery and data mining—22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3–6, 2018, Proceedings, Part III, pp 726–739

  18. Iglesias F, Kastner W (2013) Analysis of similarity measures in times series clustering for the discovery of building energy patterns. Energies 6(2):579–597

    Article  Google Scholar 

  19. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: ‘STOC’, pp 604–613

  20. Jäkel F, Schlkopf B, Wichmann F (2008) Similarity, kernels, and the triangle inequality. J Math Psychol 52(5):297–303

    Article  MathSciNet  MATH  Google Scholar 

  21. Kale DC, Gong D, Che Z, Liu Y, Medioni G, Wetzel R, Ross P (2014) An examination of multivariate time series hashing with applications to health care. In: ICDM, pp 260–269

  22. Keogh E, Pazzani M (1999) An indexing scheme for fast similarity search in large time series databases. In: SSDM, pp 56–67

  23. Kim YB, Hemberg E, O’Reilly U-M (2016) Stratified locality-sensitive hashing for accelerated physiological time series retrieval. In: EMBC

  24. Kim YB, O’Reilly U-M (2015) Large-scale physiological waveform retrieval via locality-sensitive hashing. In: EMBC, pp 5829–5833

  25. Koutra D, Faloutsos C (2017) Individual and collective graph mining: principles, algorithms, and applications. In: Synthesis lectures on data mining and knowledge discovery. Morgan and Claypool Publishers

  26. Koutra D, Shah N, Vogelstein JT, Gallagher B, Faloutsos C (2016) Deltacon: principled massive-graph similarity function with attribution. TKDD 10(3):28:1–28:43

    Article  Google Scholar 

  27. Kuo C-T, Wang X, Walker P, Carmichael O, Ye J, Davidson I (2015) Unified and contrasting cuts in multiple graphs: application to medical imaging segmentation. In: KDD, pp 617–626

  28. Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets. Cambridge University Press, Cambridge

    Book  Google Scholar 

  29. Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: SIGMOD, pp 2–11

  30. Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv 51(3):62:1–62:34

    Article  Google Scholar 

  31. Luo C, Shrivastava A (2016) SSH (Sketch, Shingle, and Hash) for indexing massive-scale time series. In: NIPS time series workshop

  32. Martínez V, Berzal F, Cubero J-C (2016) A survey of link prediction in complex networks. ACM Comput Surv 49(4):69:1–69:33

    Article  Google Scholar 

  33. Müller M (2007) Information retrieval for music and motion. Springer, New York

    Book  Google Scholar 

  34. Onnela J-P, Kaski K, Kertsz J (2004) Clustering and information in correlation based financial networks. Eur Phys J B 38:353–362

    Article  Google Scholar 

  35. Park H-J, Friston K (2013) Structural and functional brain networks: from connections to cognition. Science 342(6158):579–589

    Article  Google Scholar 

  36. Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implication of similarity search and clustering. In: PAKDD, pp 771–777

  37. Satterthwaite T, Elliott M, Ruparel K, Loughead J, Prabhakaran K, Calkins M, Hopson R, Jackson C, Keefe J, Riley M, Mentch F, Sleiman P, Verma R, Davatzikos C, Hakonarson H, Gur R, Gur R (2014) Neuroimaging of the Philadelphia neurodevelopmental cohort. Neuroimage 86:544–553

    Article  Google Scholar 

  38. Scharwächter E, Geier F, Faber L, Müller E (2018) Low redundancy estimation of correlation matrices for time series using triangular bounds. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 458–470

  39. Shah N, Koutra D, Jin L, Zou T, Gallagher B, Faloutsos C (2017) On summarizing large-scale dynamic graphs. IEEE Data Eng Bull 40(3):75–88

    Google Scholar 

  40. Shuman DI, Narang SK, Frossard P, Ortega A, Vandergheynst P (2013) The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process Mag 30(3):83–98

    Article  Google Scholar 

  41. Tsitsulin A, Mottin D, Karras P, Bronstein AM, Müller E (2018) Netlsd: hearing the shape of a graph. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2018, London, UK, August 19–23, 2018, pp 2347–2356

  42. Yang S, Sun Q, Ji S, Wonka P, Davidson I, Ye J (2015) Structural graphical lasso for learning mouse brain connectivity. In: KDD, pp 1385–1394

  43. Yeh C-CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322

  44. Zhang Y-M, Huang K, Geng G, Liu C-L (2013) Fast kNN graph construction with locality sensitive hashing. In: ECML PKDD, pp 660–674

Download references

Acknowledgements

We thank the anonymous reviewers for their useful comments and suggestions. This material is based upon work supported by the National Science Foundation under Grant No. IIS 1743088, Trove. AI, Google, and the University of Michigan. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tara Safavi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Window distance measure

Given that most existing similarity measures on time series capture pointwise similarity, we introduce a new baseline approach emphasizing consecutiveness, window distance. The window distance is a simple extension of the Hamming distance. Hamming distance, which has been shown to be a metric, is computed pointwise. We extend it here to work with sequence windows, or subsequences.

1.1 Appendix A.1: Defining a mapping

Given a binary sequence x\(\in \{0, 1\}^n\), we can segment x into contiguous overlapping subsequences of length \(k \in [1, n - k + 1]\) and define a correspondence between x and its “window mapping” \(w_k(\mathbf x {})\):

Definition 8

(Window mapped series of x) Given a binarized time series x\(\in \{0, 1\}^n\) and an integer \(k \in [1, n - k + 1]\), x’s window mapping is

$$\begin{aligned} \begin{aligned} w_k(\mathbf x {})&= [w_k(x_1), w_k(x_2), \ldots , w_k(x_{n - k + 1})] \\&= [(x_1,\ldots ,x_{k}), (x_2,\ldots ,x_{k + 1}), \ldots , (x_{n - k + 1},\ldots ,x_n)] \end{aligned} \end{aligned}$$
(3)

The number of individual subsequences of x in \(w_k(\mathbf x {})\) will be \(n - k + 1\), and each subsequence itself will be of length k, so the total length of \(w_k(\mathbf x {})\) is \(k(n - k + 1)\).

Example 4

Given \(\mathbf x {} = \mathbf{10110101}\), \(\mathbf y {} = \mathbf{10101101}\), and \(k = 3\), the respective window mappings are \(w_3(\mathbf x {}) = [101, 011, 110, 101, 010, 101]\) and \(w_3(\mathbf y {}) = [101, 010, 101, 011, 110, 101]\).

1.2 Appendix A.2: Window distance

Given two binary sequences x, y\(\in \{0, 1\}^n\) and their corresponding window mappings \(w_k(\mathbf x {})\), \(w_k(\mathbf y {})\)\(\in \{0, 1\}^{k \times (n - k + 1)}\), we can simply compute the Hamming distance between the mappings. In other words, we count the number of length-k windows at index i that do not match exactly between the sequences.

Definition 9

(Window distance measure) Given two binarized time series x, y\(\in \{0, 1\}^n\) and an integer \(k \in [1, n - k + 1]\), the window distance between the series is the number of components between the respective window mappings \(w_k(\mathbf x {})\) and \(w_k(\mathbf y {})\) that do not match exactly:

$$\begin{aligned} d_W(\mathbf x {}, \mathbf y {}){} = \sum _{i=1}^n\delta _{x[i:i+k-1] \ne y[i:i+k-1] } \end{aligned}$$
(4)

In essence, we are computing the Hamming distance between vectors with \(n - k + 1\) components, each component a sequence encoding time-consecutive windows of length k in the original sequences.

Example 5

Given \(\mathbf x {} = \mathbf{10110101}\), \(\mathbf y {} = \mathbf{10101101}\), and \(k = 3\), the distance \(d(\mathbf x {}, \mathbf y {}){}\) is 4, as there are four windows that do not agree exactly:

$$\begin{aligned} w_3(\mathbf x {})&= [101, \mathbf{011}, \mathbf{110}, \mathbf{101}, \mathbf{010}, 101] \\ w_3(\mathbf y {})&= [101, \mathbf{010}, \mathbf{101}, \mathbf{011}, \mathbf{110}, 101] \end{aligned}$$

Window similarity We can turn the window distance into a normalized similarity score between 0 and 1, which is useful for creating weighted similarity graphs, by subtracting the normalized observed distance between x and y from 1. The normalized distance is found by dividing by \(n - k + 1\). In the example above, the similarity between x and y is \(1 - \frac{4}{6} = \frac{1}{3}\).

Definition 10

(Window similarity measure) Given two binarized time series x, y\(\in \{0, 1\}^n\) and an integer \(k \in [1, n - k + 1]\), the window similarity between the series is

$$\begin{aligned} s_W(\mathbf x {}, \mathbf y {}){}&= 1 - \frac{d_W(\mathbf x {}, \mathbf y {}){}}{n - k + 1} \end{aligned}$$
(5)

1.3 Appendix A.3: Metric criteria

The window distance satisfies the criteria for a metric in the same way that the Hamming distance does.

  1. 1.

    Identity\(d_W(\mathbf x {}, \mathbf y {}){} = 0 \leftrightarrow \mathbf x {} = \mathbf y {}\). If \(\mathbf x {}\) and \(\mathbf y {}\) are the same, all of their windows will agree. Likewise, if all of the windows are the same, \(\mathbf x {}\) and \(\mathbf y {}\) will be the same.

  2. 2.

    Non-negativity\(d_W(\mathbf x {}, \mathbf y {}){} \ge 0\). The smallest number of windows that can disagree between two equal-length bit sequences \(\mathbf x {}\) and \(\mathbf y {}\) is 0.

  3. 3.

    Symmetry\(d_W(\mathbf x {}, \mathbf y {}){} = d_W(\mathbf y {}, \mathbf x {})\). The distance does not depend on which sequence is considered first.

  4. 4.

    Triangle inequality\(d_W(\mathbf x {}, \mathbf y {}){} \le d_W(\mathbf x {}, \mathbf z {}) + d_W(\mathbf z {}, \mathbf y {})\). This measure is a version of Hamming distance, which has been shown to satisfy the triangle inequality [28]. Essentially, if a is the number of components that disagree between \(w_k(\mathbf x {})\) and \(w_k(\mathbf z {})\), and b is the number of windows that disagree between \(w_k(\mathbf z {})\) and \(w_k(\mathbf y {})\), the number of windows that disagree between \(w_k(\mathbf x {})\) and \(w_k(\mathbf y {})\) cannot be more than \(a + b\).

1.4 Appendix A.4: Window-LSH

Our proposed window sampling LSH family (Sect. 5.2) using the window metric rather than ABC distance is \((d_1, d_2, 1 - \frac{d_1}{n - k + 1}, 1 - \frac{d_2}{n - k + 1})\)-sensitive. As was the case with Hamming distance, we normalize the distances \(d_1\) and \(d_2\) by dividing by the maximum distance, \(n - k + 1\), and then subtract from 1 to turn the distance into a probability.

Appendix B: Metric proof of ABC

Here we show that ABC distance satisfies the metric properties and is thus eligible for LSH.

1.1 Appendix B.1: Properties of agreeing runs

We first study the relationship between p, the number of agreeing runs between x and y, and the maximum value of \(k_1 + \cdots + k_p\), the lengths of the p agreeing runs. The maximum sum of all \(k_i\) decreases linearly as p increases.

Lemma 1

(Maximum sum of lengths of p runs \(k_1, \ldots , k_p\)) Given x, y\(\in \{0, 1\}^n\) with p agreeing runs, each of length \(k_i\), the maximum sum of the lengths of the p runs \(k_1, \ldots , k_p\) follows a linearly decreasing relationship, as \(\sum _{i=1}^p k_i = n - p + 1\).

Proof

We show that the maximum value of \(\sum _1^p k_i\) must decrease as p increases.

  1. 1.

    If \(p = 1\), \(k_1 \le n\). In other words, if there is a single matching run between x and y, the length of the matching run can be anywhere between 1 bit to n bits.

  2. 2.

    If \(p = 2\), \(k_1 + k_2 \le n - 1\). Proof by contradiction: assume \(k_1 + k_2 = n\). Then there is a matching run of \(k_1\) bits between x and y, and the remaining \(n - k_1 = k_2\) bits of x and y are also a matching run, which means that the two matching runs are consecutive, making them one long run. This means \(p = 1\), which contradicts the initial assumption that \(p = 2\). In other words, \(k_1 + k_2 \le n - 1\) because there must be at least one bit separating the run of the length \(k_1\) and the run of length \(k_2\).

  3. 3.

    If \(p = 3\), \(k_1 + k_2 + k_3 \le n - 2\). This follows from the rule above, since if \(k_1 + k_2 + k_3\) were \(n - 1\), one pair of runs would have to be merged, making \(p = 2\).

  4. 4.

    Following the observations above, \(\sum _{i=1}^p k_i \le n - p + 1\), with equality only when \(\sum _{i=1}^p k_i\) is maximized. Thus, the maximum sum of all \(k_i\) follows an inverse linear relationship with p.

\(\square \)

Lemma 2

(Relationship between p and \(\sum _1^p k_i\)) Given x, y\(\in \{0, 1\}^n\) with p agreeing runs, each of length \(k_i\), as p increases, \(\sum _{i=1}^p k_i \le n - p + 1\), as the number of bits that separate runs increases with p.

Next, we investigate the values of \(k_1, \ldots , k_p\) themselves for the “maximum similarity” \(S_{p}(\mathbf x {}, \mathbf y {}){}\) given a value of p. First, we observe that for maximizing similarity given a p, the values \(k_i\) for the lengths of the p runs must be

$$\begin{aligned} \begin{aligned} k_1 = \cdots = k_{p - 1} = 1 \\ k_p = n - 2p + 2 \end{aligned} \end{aligned}$$
(6)

Intuitively, this observation makes sense because by making \(p - 1\) agreeing runs as short as possible, the length of the p-th run is maximized, adding on the common ratio \((1 + \alpha )\) raised to the greatest exponent possible. Furthermore, note how for any value of p in the above, \(\sum _{i=1}^p k_i = n - p + 1\) because \((p - 1) + n - 2p + 2 = n - p + 1\). This fits with our previous observations: since the similarity is a summation of positive terms, we want to maximize the number of runs and the sum of their lengths.

Lemma 3

(Maximum ABC similarity) Given x, y\(\in \{0, 1\}^n\) with p agreeing runs, each of length \(k_i\), x and y have maximum ABC similarity when they agree on (without loss of generality, the first) \(p-1\) runs of length \(k_1 = \cdots = k_{p-1} = 1\) and one run of length \(k_p= n - 2p + 2\).

Proof

To confirm the intuition, we prove by induction that \(k_1 = \cdots = k_{p - 1} = 1\) and \(k_p = n - 2p + 2\) for maximum similarity given a p and x, y\(\in \{0, 1\}^n\):

  1. 1.

    Base case\(p = 1\). The similarity between x and y is maximized when x and y are exactly equal and the length of the single run is n. Thus, \(k_p = k_1 = n = (1 - 1) + n - 2(1) + 2 = (p - 1) + n - 2p + 2\).

  2. 2.

    Inductive step Assume for some p, the similarity between x and y is maximized when \(k_1 = \cdots = k_{p - 1} = 1\) and \(k_p = n - 2p + 2\). We show that this implies that for some \(p + 1\), the similarity is maximized when \(k_1 = \cdots = k_{p} = 1\) and \(k_{p + 1} = n - 2(p + 1) + 2\), assuming the same constraint \(p + 1 \le \frac{n}{2}\). Since by the inductive hypothesis the similarity between the first p runs is maximized when \(k_1 = \cdots = k_{p - 1} = 1\) and \(k_{p} = n - 2p + 2\), the “best we can do” given that we have to add a new run is to take one bit out of the long run (the run of length \(k_p\)) to contribute to the \((p + 1)\)-th run, and remove another bit from the long run such that the \((p + 1)\)-th run and the run of length \(k_p\) are not consecutive. Thus we have \(p = (p + 1) - 1\) runs of length 1, and a long run of length \(n - 2p = n - 2(p + 1) + 2\). Thus, we have shown that the inductive hypothesis for some p implies that the hypothesis is true for \(p + 1\), so by induction we maximize similarity between x and y with \(k_1 = \cdots = k_{p - 1} = 1\) and \(k_p = n - 2p + 2\).

\(\square \)

Definition 11

(Maximum ABC similarity givenp) Based on Theorem 3, the maximum ABC similarity \(S(\mathbf x , \mathbf y )_p\) between two binary time series \(\mathbf x \) and \(\mathbf y \) with p agreeing runs is

$$\begin{aligned} \begin{aligned} S(\mathbf x , \mathbf y )_{p}&= (p - 1) + \frac{(1 + \alpha )^{n - 2p + 2} - 1}{\alpha } \end{aligned} \end{aligned}$$
(7)

Likewise, the minimum ABC distance between two binary sequences \(\mathbf x \) and \(\mathbf y \) given a p is

$$\begin{aligned} \begin{aligned} \text {min}_p d(\mathbf x , \mathbf y )&= \frac{(1 + \alpha )^{n} - (1 + \alpha )^{n - 2p + 2}}{\alpha } - p + 1 \end{aligned} \end{aligned}$$
(8)

1.2 Appendix B.2: Proving the triangle inequality for ABC distance

Our main result that enables scalable network discovery is that ABC distance is a metric satisfying the triangle inequality.

Proof

We begin with the base case where \(n = 1\): x, y, and z are single bits.

  1. 1.

    x, y, and z are the same: \(d(\mathbf x {}, \mathbf y {}){} = d(\mathbf x {}, \mathbf z {}){} = d(\mathbf z {}, \mathbf y {}){} = 0\). \(0 \le 0\).

  2. 2.

    x and y are the same, z is different: \(d(\mathbf x {}, \mathbf y {}){} = 0\), \(d(\mathbf x {}, \mathbf z {}){} = 1\), and \(d(\mathbf z {}, \mathbf y {}){} = 1\). \(0 \le 2\).

  3. 3.

    x and z are the same, y is different: \(d(\mathbf x {}, \mathbf y {}){} = 1\), \(d(\mathbf x {}, \mathbf z {}){} = 0\), and \(d(\mathbf z {}, \mathbf y {}){} = 1\). \(1 \le 1\).

  4. 4.

    y and z are the same, x is different: \(d(\mathbf x {}, \mathbf y {}){} = 1\), \(d(\mathbf x {}, \mathbf z {}){} = 1\), and \(d(\mathbf z {}, \mathbf y {}){} = 0\). \(1 \le 1\).

Next, we move to the inductive step. Assume that \(d(\mathbf x {}, \mathbf y {}){} \le d(\mathbf x {}, \mathbf z {}){} + d(\mathbf z {}, \mathbf y {}){}\) for some value of \(n > 1\). We show that this implies that the inequality holds for binarized series of length \(n + 1\), which are constructed by adding a single bit to the end of each of x, y, and z to create \(\mathbf x {}'\), \(\mathbf y {}'\), and \(\mathbf z {}'\), respectively.

Setup We begin with preliminaries. First, we denote the distance between x and y\(\in \{0, 1\}^n\) as \(d(\mathbf x {}, \mathbf y {}){}\):

$$\begin{aligned} d{}(\mathbf x {}, \mathbf y {}){}&= \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \end{aligned}$$

Next, we denote the distance between \(\mathbf x {}'{}\) and \(\mathbf y {}'{}\)\(\in \{0, 1\}^{n + 1}\) as \(d'(\mathbf x {}'{}, \mathbf y {}'{})\):

$$\begin{aligned} d'(\mathbf x {}'{}, \mathbf y {}'{})&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p'}} + p' - 1}{\alpha } \end{aligned}$$

Here, \(p'\) can either be p or \(p+1\): \(p' = p\) in the case that the \(n + 1\)-th bit either appends onto an existing run between x and y, or else disagrees between the two sequences, and \(p' = p + 1\) in the case that the \(n + 1\)-th bit creates a new run of length 1 between x and y.

We now examine how distance between x and y changes by adding a single bit to the end of x and y: in other words, moving from n to \(n + 1\). We denote this change in distance \(\Delta _{(i)}\) for \(i = 1, 2, \text {or } 3\).

  1. (1)

    Case 1: the \(n + 1\)-th bits of x and y agree, creating a new run of length one between the sequences. Here \(p' = p + 1\), so \(k_{p'} = 1\) and \((1 + \alpha )^{k_{p'}} = (1 + \alpha )\).

    $$\begin{aligned} \Delta _{(1)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p}} - (1 + \alpha ) + (p + 1) - 1}{\alpha } \\&\quad -\frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n} - (1 + \alpha ) - 1}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n} - \alpha }{\alpha } \\&= (1 + \alpha )^n - 1 \end{aligned}$$

    Intuitively this result means that the maximum similarity \(S(\mathbf x {}, \mathbf y {}){}\) increases by \((1 + \alpha )^n\), and from this we subtract a new agreeing run of length 1. In other words, we subtract \((1 + \alpha )^0\) from the new maximum similarity since the exponent of a new run always begins at 0. Thus, overall the distance changes by \((1 + \alpha )^n - (1 + \alpha )^0 = (1 + \alpha )^n - 1\).

  2. (2)

    Case 2: the \(n + 1\)-th bits of x and y agree, adding or appending onto an existing run of length \(k_p\) between the sequences. Here \(p' = p\) and \(k_{p'} = k_{p} + 1\), so \((1 + \alpha )^{k_{p'}} = (1 + \alpha )^{k_{p} + 1}\).

    $$\begin{aligned} \Delta _{(2)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p} + 1} + p - 1}{\alpha } \\&\quad - \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n} - (1 + \alpha )^{k_{p} + 1} - (1 + \alpha )^{k_p}}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n} - (1 + \alpha - 1)(1 + \alpha )^{k_p}}{\alpha } \\&= (1 + \alpha )^n - (1 + \alpha )^{k_p} \end{aligned}$$
  3. (3)

    Case 3: the \(n + 1\)-th bits of x and y disagree. Here \(p' = p\) and \(k_{p'} = k_p\).

    $$\begin{aligned} \Delta _{(3)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p}} + p - 1}{\alpha } \\&\quad - \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n}}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n}}{\alpha } \\&= (1 + \alpha )^n \end{aligned}$$
Table 6 Enumeration of possible cases for the \(n + 1\)-th bit in \(\mathbf x {}'{}\), \(\mathbf y {}'{}\), and \(\mathbf z {}'{}\)

Enumeration of cases With the preliminaries established, we list all cases that can occur when we add a single bit to x, y, and z to obtain \(\mathbf x {}'{}\), \(\mathbf y {}'{}\), and \(\mathbf z {}'{}\).

Let n stand for “new run” (case (1) above with \(\Delta _{(1)}\)), a stand for “append to existing run” (case (2) with \(\Delta _{(2)}\)), and d stand for “disagree” (case (3) with \(\Delta _{(3)}\)). There are \(3 \times 3 \times 3\) length-3 permutations with repetition of n, a, and d for three series \(\mathbf x {}'{}\), \(\mathbf y {}'{}\), and \(\mathbf z {}'{}\), although not all are possible in practice: in fact, only 10 out of the 27 cases are feasible. In Table 6 we either explain why each case is impossible or else show that the triangle inequality holds.

We have shown by induction on n, the length of the binary subsequences compared, that the triangle inequality holds for the ABC distance measure because the amounts by which the distances change for any combination of new bits appended to x, y, and z satisfy the triangle inequality. Thus the ABC distance satisfies the metric properties. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Safavi, T., Sripada, C. & Koutra, D. Fast network discovery on sequence data via time-aware hashing. Knowl Inf Syst 61, 987–1017 (2019). https://doi.org/10.1007/s10115-018-1293-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1293-8

Keywords

Navigation