Fast network discovery on sequence data via time-aware hashing

476 Accesses
1 Citation
Explore all metrics

Abstract

Discovering and analyzing networks from non-network data is a task with applications in fields as diverse as neuroscience, genomics, climate science, economics, and more. In domains where networks are discovered on multiple time series, the most common approach is to compute measures of association or similarity between all pairs of time series. The nodes in the resultant network correspond to time series, which are linked by edges weighted according to the association scores of their endpoints. Finally, the fully connected network is thresholded such that only the edges with stronger weights remain and the desired sparsity level is achieved. While this approach is feasible for small datasets, its quadratic (or higher) time complexity does not scale as the individual time series length and the number of compared series increase. Thus, to circumvent the inefficient and wasteful intermediary step of building a fully connected graph before network sparsification, we propose a fast network discovery approach based on probabilistic hashing. Our methods emphasize consecutiveness, or the intuition that time series following similar fluctuations in longer time-consecutive intervals are more similar overall. Evaluation on real data shows that our method can build graphs nearly 15 times faster than baselines (when the baselines do not run out of memory), while achieving accuracy comparable to, or better than, baselines in task-based evaluation. Furthermore, our proposals are general, modular, and may be applied to a variety of sequence similarity search tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 4

Fig. 10

Discovering Dense Correlated Subgraphs in Dynamic Networks

Temporal network compression via network hashing

Article Open access 17 January 2024

Comparing Temporal Graphs Using Dynamic Time Warping

References

Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Discov 29(3):626–688
Article MathSciNet Google Scholar
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM 51(1):117–122
Article Google Scholar
Ashkenazy Y, Ivanov PC, Havlin S, Peng C-K, Goldberger AL, Stanley HE (2001) Magnitude and sign correlations in heartbeat fluctuations. Phys Rev Lett 86(9):1900–1903
Article Google Scholar
Balakrishnan N, Koutras M (2002) Runs and scans with applications. Wiley, Hoboken
MATH Google Scholar
Bassett D, Bullmore E (2009) Human brain networks in health and disease. Curr Opin Neurol 22(4):340–347
Article Google Scholar
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on world wide web, pp 131–140
Brugere I, Gallagher B, Berger-Wolf TY (2018) Network structure inference, a survey: motivations, methods, and applications. ACM Comput Surv (CSUR) 51(2):24
Article Google Scholar
Bullmore E, Sporns O (2009) Complex brain networks: graph theoretical analysis of structural and functional systems. Nat Rev Neurosci 10(3):186–198
Article Google Scholar
Center for Biomedical Research Excellence (2012) http://fcon\_1000.projects.nitrc.org/indi/retro/cobre.html
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering. ICDE ’06
Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/. Accessed 1 Jan 2017
Dai Z, He Y (2014) Disrupted structural and functional brain connectomes in mild cognitive impairment and Alzheimer’s disease. Neurosci Bull 30(2):217–232
Article Google Scholar
Davidson I, Gilpin S, Carmichael O, Walker P (2013) Network discovery via constrained tensor analysis of fmri data. In: KDD, pp 194–202
Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web, ACM, pp 577–586
Friston KJ (2011) Functional and effective connectivity: a review. Brain Connect 1(1):13–36
Article MathSciNet Google Scholar
Hallac D, Park Y, Boyd S, Leskovec J (2017) Network inference via the time-varying graphical lasso. In: ‘KDD’
Heimann M, Lee W, Pan S, Chen K, Koutra D (2018) Hashalign: Hash-based alignment of multiple graphs. In: Advances in knowledge discovery and data mining—22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3–6, 2018, Proceedings, Part III, pp 726–739
Iglesias F, Kastner W (2013) Analysis of similarity measures in times series clustering for the discovery of building energy patterns. Energies 6(2):579–597
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: ‘STOC’, pp 604–613
Jäkel F, Schlkopf B, Wichmann F (2008) Similarity, kernels, and the triangle inequality. J Math Psychol 52(5):297–303
Article MathSciNet MATH Google Scholar
Kale DC, Gong D, Che Z, Liu Y, Medioni G, Wetzel R, Ross P (2014) An examination of multivariate time series hashing with applications to health care. In: ICDM, pp 260–269
Keogh E, Pazzani M (1999) An indexing scheme for fast similarity search in large time series databases. In: SSDM, pp 56–67
Kim YB, Hemberg E, O’Reilly U-M (2016) Stratified locality-sensitive hashing for accelerated physiological time series retrieval. In: EMBC
Kim YB, O’Reilly U-M (2015) Large-scale physiological waveform retrieval via locality-sensitive hashing. In: EMBC, pp 5829–5833
Koutra D, Faloutsos C (2017) Individual and collective graph mining: principles, algorithms, and applications. In: Synthesis lectures on data mining and knowledge discovery. Morgan and Claypool Publishers
Koutra D, Shah N, Vogelstein JT, Gallagher B, Faloutsos C (2016) Deltacon: principled massive-graph similarity function with attribution. TKDD 10(3):28:1–28:43
Article Google Scholar
Kuo C-T, Wang X, Walker P, Carmichael O, Ye J, Davidson I (2015) Unified and contrasting cuts in multiple graphs: application to medical imaging segmentation. In: KDD, pp 617–626
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets. Cambridge University Press, Cambridge
Book Google Scholar
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: SIGMOD, pp 2–11
Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv 51(3):62:1–62:34
Article Google Scholar
Luo C, Shrivastava A (2016) SSH (Sketch, Shingle, and Hash) for indexing massive-scale time series. In: NIPS time series workshop
Martínez V, Berzal F, Cubero J-C (2016) A survey of link prediction in complex networks. ACM Comput Surv 49(4):69:1–69:33
Article Google Scholar
Müller M (2007) Information retrieval for music and motion. Springer, New York
Book Google Scholar
Onnela J-P, Kaski K, Kertsz J (2004) Clustering and information in correlation based financial networks. Eur Phys J B 38:353–362
Article Google Scholar
Park H-J, Friston K (2013) Structural and functional brain networks: from connections to cognition. Science 342(6158):579–589
Article Google Scholar
Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implication of similarity search and clustering. In: PAKDD, pp 771–777
Satterthwaite T, Elliott M, Ruparel K, Loughead J, Prabhakaran K, Calkins M, Hopson R, Jackson C, Keefe J, Riley M, Mentch F, Sleiman P, Verma R, Davatzikos C, Hakonarson H, Gur R, Gur R (2014) Neuroimaging of the Philadelphia neurodevelopmental cohort. Neuroimage 86:544–553
Article Google Scholar
Scharwächter E, Geier F, Faber L, Müller E (2018) Low redundancy estimation of correlation matrices for time series using triangular bounds. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 458–470
Shah N, Koutra D, Jin L, Zou T, Gallagher B, Faloutsos C (2017) On summarizing large-scale dynamic graphs. IEEE Data Eng Bull 40(3):75–88
Google Scholar
Shuman DI, Narang SK, Frossard P, Ortega A, Vandergheynst P (2013) The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process Mag 30(3):83–98
Article Google Scholar
Tsitsulin A, Mottin D, Karras P, Bronstein AM, Müller E (2018) Netlsd: hearing the shape of a graph. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2018, London, UK, August 19–23, 2018, pp 2347–2356
Yang S, Sun Q, Ji S, Wonka P, Davidson I, Ye J (2015) Structural graphical lasso for learning mouse brain connectivity. In: KDD, pp 1385–1394
Yeh C-CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322
Zhang Y-M, Huang K, Geng G, Liu C-L (2013) Fast kNN graph construction with locality sensitive hashing. In: ECML PKDD, pp 660–674

Download references

Acknowledgements

We thank the anonymous reviewers for their useful comments and suggestions. This material is based upon work supported by the National Science Foundation under Grant No. IIS 1743088, Trove. AI, Google, and the University of Michigan. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

Computer Science and Engineering, University of Michigan, Ann Arbor, USA
Tara Safavi & Danai Koutra
Psychiatry and Philosophy, University of Michigan, Ann Arbor, USA
Chandra Sripada

Authors

Tara Safavi
View author publications
You can also search for this author in PubMed Google Scholar
Chandra Sripada
View author publications
You can also search for this author in PubMed Google Scholar
Danai Koutra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tara Safavi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Window distance measure

Given that most existing similarity measures on time series capture pointwise similarity, we introduce a new baseline approach emphasizing consecutiveness, window distance. The window distance is a simple extension of the Hamming distance. Hamming distance, which has been shown to be a metric, is computed pointwise. We extend it here to work with sequence windows, or subsequences.

1.1 Appendix A.1: Defining a mapping

Given a binary sequence x$\in \{0, 1\}^n$, we can segment x into contiguous overlapping subsequences of length $k \in [1, n - k + 1]$ and define a correspondence between x and its “window mapping” $w_k(\mathbf x {})$:

Definition 8

(Window mapped series of x) Given a binarized time series x$\in \{0, 1\}^n$ and an integer $k \in [1, n - k + 1]$, x’s window mapping is

$$\begin{aligned} \begin{aligned} w_k(\mathbf x {})&= [w_k(x_1), w_k(x_2), \ldots , w_k(x_{n - k + 1})] \\&= [(x_1,\ldots ,x_{k}), (x_2,\ldots ,x_{k + 1}), \ldots , (x_{n - k + 1},\ldots ,x_n)] \end{aligned} \end{aligned}$$

(3)

The number of individual subsequences of x in $w_k(\mathbf x {})$ will be $n - k + 1$, and each subsequence itself will be of length k, so the total length of $w_k(\mathbf x {})$ is $k(n - k + 1)$.

Example 4

Given $\mathbf x {} = \mathbf{10110101}$, $\mathbf y {} = \mathbf{10101101}$, and $k = 3$, the respective window mappings are $w_3(\mathbf x {}) = [101, 011, 110, 101, 010, 101]$ and $w_3(\mathbf y {}) = [101, 010, 101, 011, 110, 101]$.

1.2 Appendix A.2: Window distance

Given two binary sequences x, y$\in \{0, 1\}^n$ and their corresponding window mappings $w_k(\mathbf x {})$, $w_k(\mathbf y {})$$\in \{0, 1\}^{k \times (n - k + 1)}$, we can simply compute the Hamming distance between the mappings. In other words, we count the number of length-k windows at index i that do not match exactly between the sequences.

Definition 9

(Window distance measure) Given two binarized time series x, y$\in \{0, 1\}^n$ and an integer $k \in [1, n - k + 1]$, the window distance between the series is the number of components between the respective window mappings $w_k(\mathbf x {})$ and $w_k(\mathbf y {})$ that do not match exactly:

$$\begin{aligned} d_W(\mathbf x {}, \mathbf y {}){} = \sum _{i=1}^n\delta _{x[i:i+k-1] \ne y[i:i+k-1] } \end{aligned}$$

(4)

In essence, we are computing the Hamming distance between vectors with $n - k + 1$ components, each component a sequence encoding time-consecutive windows of length k in the original sequences.

Example 5

Given $\mathbf x {} = \mathbf{10110101}$, $\mathbf y {} = \mathbf{10101101}$, and $k = 3$, the distance $d(\mathbf x {}, \mathbf y {}){}$ is 4, as there are four windows that do not agree exactly:

$$\begin{aligned} w_3(\mathbf x {})&= [101, \mathbf{011}, \mathbf{110}, \mathbf{101}, \mathbf{010}, 101] \\ w_3(\mathbf y {})&= [101, \mathbf{010}, \mathbf{101}, \mathbf{011}, \mathbf{110}, 101] \end{aligned}$$

Window similarity We can turn the window distance into a normalized similarity score between 0 and 1, which is useful for creating weighted similarity graphs, by subtracting the normalized observed distance between x and y from 1. The normalized distance is found by dividing by $n - k + 1$. In the example above, the similarity between x and y is $1 - \frac{4}{6} = \frac{1}{3}$.

Definition 10

(Window similarity measure) Given two binarized time series x, y$\in \{0, 1\}^n$ and an integer $k \in [1, n - k + 1]$, the window similarity between the series is

$$\begin{aligned} s_W(\mathbf x {}, \mathbf y {}){}&= 1 - \frac{d_W(\mathbf x {}, \mathbf y {}){}}{n - k + 1} \end{aligned}$$

(5)

1.3 Appendix A.3: Metric criteria

The window distance satisfies the criteria for a metric in the same way that the Hamming distance does.

1.
Identity$d_W(\mathbf x {}, \mathbf y {}){} = 0 \leftrightarrow \mathbf x {} = \mathbf y {}$. If $\mathbf x {}$ and $\mathbf y {}$ are the same, all of their windows will agree. Likewise, if all of the windows are the same, $\mathbf x {}$ and $\mathbf y {}$ will be the same.
2.
Non-negativity$d_W(\mathbf x {}, \mathbf y {}){} \ge 0$. The smallest number of windows that can disagree between two equal-length bit sequences $\mathbf x {}$ and $\mathbf y {}$ is 0.
3.
Symmetry$d_W(\mathbf x {}, \mathbf y {}){} = d_W(\mathbf y {}, \mathbf x {})$. The distance does not depend on which sequence is considered first.
4.
Triangle inequality$d_W(\mathbf x {}, \mathbf y {}){} \le d_W(\mathbf x {}, \mathbf z {}) + d_W(\mathbf z {}, \mathbf y {})$. This measure is a version of Hamming distance, which has been shown to satisfy the triangle inequality [28]. Essentially, if a is the number of components that disagree between $w_k(\mathbf x {})$ and $w_k(\mathbf z {})$, and b is the number of windows that disagree between $w_k(\mathbf z {})$ and $w_k(\mathbf y {})$, the number of windows that disagree between $w_k(\mathbf x {})$ and $w_k(\mathbf y {})$ cannot be more than $a + b$.

1.4 Appendix A.4: Window-LSH

Our proposed window sampling LSH family (Sect. 5.2) using the window metric rather than ABC distance is $(d_1, d_2, 1 - \frac{d_1}{n - k + 1}, 1 - \frac{d_2}{n - k + 1})$-sensitive. As was the case with Hamming distance, we normalize the distances $d_1$ and $d_2$ by dividing by the maximum distance, $n - k + 1$, and then subtract from 1 to turn the distance into a probability.

Appendix B: Metric proof of ABC

Here we show that ABC distance satisfies the metric properties and is thus eligible for LSH.

1.1 Appendix B.1: Properties of agreeing runs

We first study the relationship between p, the number of agreeing runs between x and y, and the maximum value of $k_1 + \cdots + k_p$, the lengths of the p agreeing runs. The maximum sum of all $k_i$ decreases linearly as p increases.

Lemma 1

(Maximum sum of lengths of p runs $k_1, \ldots , k_p$) Given x, y$\in \{0, 1\}^n$ with p agreeing runs, each of length $k_i$, the maximum sum of the lengths of the p runs $k_1, \ldots , k_p$ follows a linearly decreasing relationship, as $\sum _{i=1}^p k_i = n - p + 1$.

Proof

We show that the maximum value of $\sum _1^p k_i$ must decrease as p increases.

1.
If $p = 1$, $k_1 \le n$. In other words, if there is a single matching run between x and y, the length of the matching run can be anywhere between 1 bit to n bits.
2.
If $p = 2$, $k_1 + k_2 \le n - 1$. Proof by contradiction: assume $k_1 + k_2 = n$. Then there is a matching run of $k_1$ bits between x and y, and the remaining $n - k_1 = k_2$ bits of x and y are also a matching run, which means that the two matching runs are consecutive, making them one long run. This means $p = 1$, which contradicts the initial assumption that $p = 2$. In other words, $k_1 + k_2 \le n - 1$ because there must be at least one bit separating the run of the length $k_1$ and the run of length $k_2$.
3.
If $p = 3$, $k_1 + k_2 + k_3 \le n - 2$. This follows from the rule above, since if $k_1 + k_2 + k_3$ were $n - 1$, one pair of runs would have to be merged, making $p = 2$.
4.
Following the observations above, $\sum _{i=1}^p k_i \le n - p + 1$, with equality only when $\sum _{i=1}^p k_i$ is maximized. Thus, the maximum sum of all $k_i$ follows an inverse linear relationship with p.

$\square $

Lemma 2

(Relationship between p and $\sum _1^p k_i$) Given x, y$\in \{0, 1\}^n$ with p agreeing runs, each of length $k_i$, as p increases, $\sum _{i=1}^p k_i \le n - p + 1$, as the number of bits that separate runs increases with p.

Next, we investigate the values of $k_1, \ldots , k_p$ themselves for the “maximum similarity” $S_{p}(\mathbf x {}, \mathbf y {}){}$ given a value of p. First, we observe that for maximizing similarity given a p, the values $k_i$ for the lengths of the p runs must be

$$\begin{aligned} \begin{aligned} k_1 = \cdots = k_{p - 1} = 1 \\ k_p = n - 2p + 2 \end{aligned} \end{aligned}$$

(6)

Intuitively, this observation makes sense because by making $p - 1$ agreeing runs as short as possible, the length of the p-th run is maximized, adding on the common ratio $(1 + \alpha )$ raised to the greatest exponent possible. Furthermore, note how for any value of p in the above, $\sum _{i=1}^p k_i = n - p + 1$ because $(p - 1) + n - 2p + 2 = n - p + 1$. This fits with our previous observations: since the similarity is a summation of positive terms, we want to maximize the number of runs and the sum of their lengths.

Lemma 3

(Maximum ABC similarity) Given x, y$\in \{0, 1\}^n$ with p agreeing runs, each of length $k_i$, x and y have maximum ABC similarity when they agree on (without loss of generality, the first) $p-1$ runs of length $k_1 = \cdots = k_{p-1} = 1$ and one run of length $k_p= n - 2p + 2$.

Proof

To confirm the intuition, we prove by induction that $k_1 = \cdots = k_{p - 1} = 1$ and $k_p = n - 2p + 2$ for maximum similarity given a p and x, y$\in \{0, 1\}^n$:

1.
Base case$p = 1$. The similarity between x and y is maximized when x and y are exactly equal and the length of the single run is n. Thus, $k_p = k_1 = n = (1 - 1) + n - 2(1) + 2 = (p - 1) + n - 2p + 2$.
2.
Inductive step Assume for some p, the similarity between x and y is maximized when $k_1 = \cdots = k_{p - 1} = 1$ and $k_p = n - 2p + 2$. We show that this implies that for some $p + 1$, the similarity is maximized when $k_1 = \cdots = k_{p} = 1$ and $k_{p + 1} = n - 2(p + 1) + 2$, assuming the same constraint $p + 1 \le \frac{n}{2}$. Since by the inductive hypothesis the similarity between the first p runs is maximized when $k_1 = \cdots = k_{p - 1} = 1$ and $k_{p} = n - 2p + 2$, the “best we can do” given that we have to add a new run is to take one bit out of the long run (the run of length $k_p$) to contribute to the $(p + 1)$-th run, and remove another bit from the long run such that the $(p + 1)$-th run and the run of length $k_p$ are not consecutive. Thus we have $p = (p + 1) - 1$ runs of length 1, and a long run of length $n - 2p = n - 2(p + 1) + 2$. Thus, we have shown that the inductive hypothesis for some p implies that the hypothesis is true for $p + 1$, so by induction we maximize similarity between x and y with $k_1 = \cdots = k_{p - 1} = 1$ and $k_p = n - 2p + 2$.

$\square $

Definition 11

(Maximum ABC similarity givenp) Based on Theorem 3, the maximum ABC similarity $S(\mathbf x , \mathbf y )_p$ between two binary time series $\mathbf x $ and $\mathbf y $ with p agreeing runs is

$$\begin{aligned} \begin{aligned} S(\mathbf x , \mathbf y )_{p}&= (p - 1) + \frac{(1 + \alpha )^{n - 2p + 2} - 1}{\alpha } \end{aligned} \end{aligned}$$

(7)

Likewise, the minimum ABC distance between two binary sequences $\mathbf x $ and $\mathbf y $ given a p is

$$\begin{aligned} \begin{aligned} \text {min}_p d(\mathbf x , \mathbf y )&= \frac{(1 + \alpha )^{n} - (1 + \alpha )^{n - 2p + 2}}{\alpha } - p + 1 \end{aligned} \end{aligned}$$

(8)

1.2 Appendix B.2: Proving the triangle inequality for ABC distance

Our main result that enables scalable network discovery is that ABC distance is a metric satisfying the triangle inequality.

Proof

We begin with the base case where $n = 1$: x, y, and z are single bits.

1.
x, y, and z are the same: $d(\mathbf x {}, \mathbf y {}){} = d(\mathbf x {}, \mathbf z {}){} = d(\mathbf z {}, \mathbf y {}){} = 0$. $0 \le 0$.
2.
x and y are the same, z is different: $d(\mathbf x {}, \mathbf y {}){} = 0$, $d(\mathbf x {}, \mathbf z {}){} = 1$, and $d(\mathbf z {}, \mathbf y {}){} = 1$. $0 \le 2$.
3.
x and z are the same, y is different: $d(\mathbf x {}, \mathbf y {}){} = 1$, $d(\mathbf x {}, \mathbf z {}){} = 0$, and $d(\mathbf z {}, \mathbf y {}){} = 1$. $1 \le 1$.
4.
y and z are the same, x is different: $d(\mathbf x {}, \mathbf y {}){} = 1$, $d(\mathbf x {}, \mathbf z {}){} = 1$, and $d(\mathbf z {}, \mathbf y {}){} = 0$. $1 \le 1$.

Next, we move to the inductive step. Assume that $d(\mathbf x {}, \mathbf y {}){} \le d(\mathbf x {}, \mathbf z {}){} + d(\mathbf z {}, \mathbf y {}){}$ for some value of $n > 1$. We show that this implies that the inequality holds for binarized series of length $n + 1$, which are constructed by adding a single bit to the end of each of x, y, and z to create $\mathbf x {}'$, $\mathbf y {}'$, and $\mathbf z {}'$, respectively.

Setup We begin with preliminaries. First, we denote the distance between x and y$\in \{0, 1\}^n$ as $d(\mathbf x {}, \mathbf y {}){}$:

$$\begin{aligned} d{}(\mathbf x {}, \mathbf y {}){}&= \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \end{aligned}$$

Next, we denote the distance between $\mathbf x {}'{}$ and $\mathbf y {}'{}$$\in \{0, 1\}^{n + 1}$ as $d'(\mathbf x {}'{}, \mathbf y {}'{})$:

$$\begin{aligned} d'(\mathbf x {}'{}, \mathbf y {}'{})&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p'}} + p' - 1}{\alpha } \end{aligned}$$

Here, $p'$ can either be p or $p+1$: $p' = p$ in the case that the $n + 1$-th bit either appends onto an existing run between x and y, or else disagrees between the two sequences, and $p' = p + 1$ in the case that the $n + 1$-th bit creates a new run of length 1 between x and y.

We now examine how distance between x and y changes by adding a single bit to the end of x and y: in other words, moving from n to $n + 1$. We denote this change in distance $\Delta _{(i)}$ for $i = 1, 2, \text {or } 3$.

(1)
Case 1: the $n + 1$-th bits of x and y agree, creating a new run of length one between the sequences. Here $p' = p + 1$, so $k_{p'} = 1$ and $(1 + \alpha )^{k_{p'}} = (1 + \alpha )$.
$$\begin{aligned} \Delta _{(1)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p}} - (1 + \alpha ) + (p + 1) - 1}{\alpha } \\&\quad -\frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n} - (1 + \alpha ) - 1}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n} - \alpha }{\alpha } \\&= (1 + \alpha )^n - 1 \end{aligned}$$
Intuitively this result means that the maximum similarity $S(\mathbf x {}, \mathbf y {}){}$ increases by $(1 + \alpha )^n$, and from this we subtract a new agreeing run of length 1. In other words, we subtract $(1 + \alpha )^0$ from the new maximum similarity since the exponent of a new run always begins at 0. Thus, overall the distance changes by $(1 + \alpha )^n - (1 + \alpha )^0 = (1 + \alpha )^n - 1$.
(2)
Case 2: the $n + 1$-th bits of x and y agree, adding or appending onto an existing run of length $k_p$ between the sequences. Here $p' = p$ and $k_{p'} = k_{p} + 1$, so $(1 + \alpha )^{k_{p'}} = (1 + \alpha )^{k_{p} + 1}$.
$$\begin{aligned} \Delta _{(2)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p} + 1} + p - 1}{\alpha } \\&\quad - \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n} - (1 + \alpha )^{k_{p} + 1} - (1 + \alpha )^{k_p}}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n} - (1 + \alpha - 1)(1 + \alpha )^{k_p}}{\alpha } \\&= (1 + \alpha )^n - (1 + \alpha )^{k_p} \end{aligned}$$
(3)
Case 3: the $n + 1$-th bits of x and y disagree. Here $p' = p$ and $k_{p'} = k_p$.
$$\begin{aligned} \Delta _{(3)}&= d'(\mathbf x {}'{}, \mathbf y {}'{}) - d(\mathbf x {}, \mathbf y {}){} \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_{p}} + p - 1}{\alpha } \\&\quad - \frac{(1 + \alpha )^n - (1 + \alpha )^{k_1} - \cdots - (1 + \alpha )^{k_p} + p - 1}{\alpha } \\&= \frac{(1 + \alpha )^{n + 1} - (1 + \alpha )^{n}}{\alpha } \\&= \frac{(1 + \alpha - 1)(1 + \alpha )^{n}}{\alpha } \\&= (1 + \alpha )^n \end{aligned}$$

Table 6 Enumeration of possible cases for the $n + 1$-th bit in $\mathbf x {}'{}$, $\mathbf y {}'{}$, and $\mathbf z {}'{}$

Full size table

Enumeration of cases With the preliminaries established, we list all cases that can occur when we add a single bit to x, y, and z to obtain $\mathbf x {}'{}$, $\mathbf y {}'{}$, and $\mathbf z {}'{}$.

Let n stand for “new run” (case (1) above with $\Delta _{(1)}$), a stand for “append to existing run” (case (2) with $\Delta _{(2)}$), and d stand for “disagree” (case (3) with $\Delta _{(3)}$). There are $3 \times 3 \times 3$ length-3 permutations with repetition of n, a, and d for three series $\mathbf x {}'{}$, $\mathbf y {}'{}$, and $\mathbf z {}'{}$, although not all are possible in practice: in fact, only 10 out of the 27 cases are feasible. In Table 6 we either explain why each case is impossible or else show that the triangle inequality holds.

We have shown by induction on n, the length of the binary subsequences compared, that the triangle inequality holds for the ABC distance measure because the amounts by which the distances change for any combination of new bits appended to x, y, and z satisfy the triangle inequality. Thus the ABC distance satisfies the metric properties. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Safavi, T., Sripada, C. & Koutra, D. Fast network discovery on sequence data via time-aware hashing. Knowl Inf Syst 61, 987–1017 (2019). https://doi.org/10.1007/s10115-018-1293-8

Download citation

Received: 21 December 2017
Revised: 03 October 2018
Accepted: 24 November 2018
Published: 19 December 2018
Issue Date: 01 November 2019
DOI: https://doi.org/10.1007/s10115-018-1293-8

Fast network discovery on sequence data via time-aware hashing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering Dense Correlated Subgraphs in Dynamic Networks

Temporal network compression via network hashing

Comparing Temporal Graphs Using Dynamic Time Warping

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Window distance measure

1.1 Appendix A.1: Defining a mapping

Definition 8

Example 4

1.2 Appendix A.2: Window distance

Definition 9

Example 5

Definition 10

1.3 Appendix A.3: Metric criteria

1.4 Appendix A.4: Window-LSH

Appendix B: Metric proof of ABC

1.1 Appendix B.1: Properties of agreeing runs

Lemma 1

Proof

Lemma 2

Lemma 3

Proof

Definition 11

1.2 Appendix B.2: Proving the triangle inequality for ABC distance

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation