Abstract
The context in which a substring appears is an important notion to identify – for example – its semantic meaning. However, existing classes of repeats fail to take this into account directly. We present here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases maximal and super-maximal repeats.
We give sufficient and necessary condition to bound their number linearly in the size of the sequence, and show an optimal algorithm that computes them in linear time – given a suffix array –, independent on the size of the alphabet, as well as two other algorithms that are faster in practice.
Additionally, we provide an independent and general framework that allows to compute these (and other) repeats incrementally; extending the application space of repeats in a streaming framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Apostolico, A.: Of maps bigger than the empire (invited paper). In: SPIRE, pp. 2–9 (2001)
Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)
Carrascosa, R., Coste, F., Gallé, M., Infante-Lopez, G.: The smallest grammar problem as constituents choice and minimal grammar parsing. MDPI Algorithms 4(4), 262–284 (2011)
Clark, A.: Learning deterministic context free grammars: The Omphalos competition. Machine Learning, 93–110 (January 2007)
Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: ICGI, pp. 29–42 (July 2008)
Fenwick, P.M.: A New Data Structure for Cumulative Frequency Tables. Softw. Pract. Exper. 24, 327–336 (1994)
Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)
Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR, pp. 1053–1056 (2013)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)
Navarro, G.: Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences. arXiv (2013)
Ohlebusch, E., Beller, T., Abouelhoda, M.I.: Computing the Burrows Wheeler transform of a string and its reverse in parallel. Journal of Discrete Algorithms 1, 1–13 (2013), http://linkinghub.elsevier.com/retrieve/pii/S1570866713000397
Puglisi, S., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2) (July 2007), http://portal.acm.org/citation.cfm?id=1242471.1242472
Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Prague Stringology Conference, pp. 161–169 (2008)
Schütze, H.: Automatic Word Sense Discrimination. Comput. Ling. 24(1) (1998)
Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. PNAS, 11629–11634 (January 2005)
Ukkonen, E.: Online construction of suffix trees. Algorithmica 14, 249–260 (1995)
van Zaanen, M.: ABL: Alignment-based learning. In: International Conference on Computational Linguistics, pp. 961–967 (2000)
Zhang, S., Nong, G., Chan, W.H.: Fast and space efficient linear suffix array construction. In: DCC. IEEE Computer Society, Washington, DC (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gallé, M., Tealdi, M. (2014). On Context-Diverse Repeats and Their Incremental Computation. In: Dediu, AH., Martín-Vide, C., Sierra-Rodríguez, JL., Truthe, B. (eds) Language and Automata Theory and Applications. LATA 2014. Lecture Notes in Computer Science, vol 8370. Springer, Cham. https://doi.org/10.1007/978-3-319-04921-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-04921-2_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04920-5
Online ISBN: 978-3-319-04921-2
eBook Packages: Computer ScienceComputer Science (R0)