[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Property Suffix Array with Applications in Indexing Weighted Sequences

Published: 03 May 2020 Publication History

Abstract

The suffix array is one of the most prevalent data structures for string indexing; it stores the lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of length m matches a weighted sequence x of length n at starting position i if the product of probabilities of the letters of p at positions i, …, i+m − 1 in x is at least 1/z. Our algorithm for building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in time and space O(nz). Finally, we present extensive experimental results showing that our new indexing data structure is well suited for real-world applications.

References

[1]
Adam Auton, Lisa D. Brooks, Richard M. Durbin, Erik P. Garrison, Hyun Min M. Kang, Jan O. Korbel, Jonathan L. Marchini, Shane McCarthy, Gil A. McVean, and Gonçalo R. Abecasis (1000 Genomes Project Consortium). 2015. A global reference for human genetic variation. Nature 526, 7571 (1 Oct. 2015), 68--74.
[2]
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discr. Algor. 2, 1 (2004), 53--86.
[3]
Charu C. Aggarwal and Philip S. Yu. 2009. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21, 5 (2009), 609--623.
[4]
Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi. 2019. Computing the antiperiod(s) of a string. In Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching (CPM’19), Nadia Pisanti and Solon P. Pissis (Eds.), LIPIcs, Vol. 128. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 32:1--32:11.
[5]
Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Solon P. Pissis. 2017. How to answer a small batch of RMQs or LCA queries in practice. In Proceedings of the International Workshop on Combinatorial Algorithms (IWOCA’17). 343--355.
[6]
Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. 2008. Property matching and weighted matching. Theor. Comput. Sci. 395, 2--3 (2008), 298--310.
[7]
Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub Radoszewski. 2020. Indexing weighted sequences: Neat and efficient. Inf. Comput. 270 (2020), 104462.
[8]
Carl Barton, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. 2016a. Efficient index for weighted sequences. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM’16). 4:1--4:13.
[9]
Carl Barton, Chang Liu, and Solon P. Pissis. 2016b. On-line pattern matching on uncertain sequences and applications. In Proceedings of the Annual International Conference on Combinatorial Optimization and Applications (COCOA’16). 547--562.
[10]
Carl Barton, Chang Liu, and Solon P. Pissis. 2018. Fast average-case pattern matching on weighted sequences. Int. J. Found. Comput. Sci. 29, 8 (2018), 1331--1343.
[11]
Niklas Baumstark, Simon Gog, Tobias Heuer, and Julian Labeit. 2017. Practical range minimum queries revisited. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17), Costas S. Iliopoulos, Solon P. Pissis, Simon J. Puglisi, and Rajeev Raman (Eds.), LIPIcs, Vol. 75. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 12:1--12:16.
[12]
Michael A. Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algor. 57, 2 (2005), 75--94.
[13]
Sudip Biswas, Manish Patil, Sharma V. Thankachan, and Rahul Shah. 2016. Probabilistic threshold indexing for uncertain strings. In Proceedings of the Extended Database Technology Conference (EDBT’16). 401--412.
[14]
Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. 2018. Property suffix array with applications. In Proceedings of the Theoretical Informatics (LATIN’18). 290--302.
[15]
Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, and Jakub Radoszewski. 2019. On-line weighted pattern matching. Inf. Comput. 266 (2019), 49--59.
[16]
Richard Cole. 1988. Parallel merge sort. SIAM J. Comput. 17, 4 (1988), 770--785.
[17]
Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. 2001. Introduction to Algorithms (2nd ed.). McGraw--Hill Higher Education.
[18]
Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Krzysztof Stencel, and Tomasz Walen. 2014. New simple efficient algorithms computing powers and runs in strings. Discr. Appl. Math. 163, 3 (2014), 258--267.
[19]
Héctor Ferrada and Gonzalo Navarro. 2017. Improved range minimum queries. J. Discr. Algor. 43 (2017), 72--80.
[20]
Johannes Fischer and Volker Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 2 (2011), 465--492.
[21]
Harold N. Gabow and Robert Endre Tarjan. 1985. A linear-time algorithm for a special case of disjoint set union. J. Comput. Syst. Sci. 30, 2 (1985), 209--221.
[22]
Wing-Kai Hon, Manish Patil, Rahul Shah, and Sharma V. Thankachan. 2013. Compressed property suffix trees. Inf. Comput. 232 (2013), 10--18.
[23]
Costas S. Iliopoulos and Mohammad Sohel Rahman. 2008. Faster index for property matching. Inf. Process. Lett. 105, 6 (2008), 218--223.
[24]
M. T. Juan, J. J. Liu, and Y. L. Wang. 2009. Errata for “Faster index for property matching.” Inf. Process. Lett. 109, 18 (2009), 1027--1029.
[25]
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, and Bella Zhukova. 2017. Engineering external memory induced suffix sorting. In Proceedings of the 19th Workshop on Algorithm Engineering and Experiments (ALENEX’17), Sándor P. Fekete and Vijaya Ramachandran (Eds.). SIAM, 98--108.
[26]
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM’09), Gregory Kucherov and Esko Ukkonen (Eds.), Lecture Notes in Computer Science, Vol. 5577. Springer, 181--192.
[27]
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.
[28]
Samuel Karlin, Ghassan Ghandour, Friedemann Ost amd Simon Tavare, and Laurence J. Korn. 1983. New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. U.S.A. 80, 18 (1983), 5660--5664.
[29]
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM’01). 181--192.
[30]
Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. 2011. A linear time algorithm for seeds computation. CoRR arXiv:1107.2422v2 (2011). https://arxiv.org/abs/1107.2422v2
[31]
Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. 2019. Pattern matching and consensus problems on weighted sequences and profiles. Theory Comput. Syst. 63, 3 (2019), 506--542.
[32]
Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. 2018. Efficient algorithms for shortest partial seeds in words. Theor. Comput. Sci. 710 (2018), 139--147.
[33]
Tsvi Kopelowitz. 2016. The property suffix tree with dynamic properties. Theor. Comput. Sci. 638 (2016), 44--51.
[34]
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.
[35]
Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the Data Compression Conference (DCC’09). 193--202.
[36]
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2 (2007), 4.
[37]
Jakub Radoszewski and Tatiana A. Starikovskaya. 2017. Streaming K-mismatch with error correcting and applications. In Proceedings of the Data Compression Conference (DCC’17). 290--299.
[38]
Peter Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory (SWAT’73 (FOCS’73)). 1--11.

Cited By

View all
  • (2024)Space-Efficient Indexes for Uncertain Strings2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00367(4828-4842)Online publication date: 13-May-2024
  • (2024)Elastic-Degenerate String Matching with 1 Error or MismatchTheory of Computing Systems10.1007/s00224-024-10194-868:5(1442-1467)Online publication date: 1-Oct-2024
  • (2023)Bidirectional String Anchors for Improved Text Indexing and Top-$K$ Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323178035:11(11093-11111)Online publication date: 1-Nov-2023
  • Show More Cited By

Index Terms

  1. Property Suffix Array with Applications in Indexing Weighted Sequences

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Journal of Experimental Algorithmics
    ACM Journal of Experimental Algorithmics  Volume 25, Issue
    Special Issue ALENEX 2018 and Regular Papers
    2020
    313 pages
    ISSN:1084-6654
    EISSN:1084-6654
    DOI:10.1145/3388470
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2020
    Accepted: 01 February 2020
    Revised: 01 February 2020
    Received: 01 June 2019
    Published in JEA Volume 25

    Author Tags

    1. Suffix tree
    2. property suffix array
    3. property suffix tree
    4. suffix array
    5. weighted sequence

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • European Research Council
    • A.G. Leventis Foundation educational

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Space-Efficient Indexes for Uncertain Strings2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00367(4828-4842)Online publication date: 13-May-2024
    • (2024)Elastic-Degenerate String Matching with 1 Error or MismatchTheory of Computing Systems10.1007/s00224-024-10194-868:5(1442-1467)Online publication date: 1-Oct-2024
    • (2023)Bidirectional String Anchors for Improved Text Indexing and Top-$K$ Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323178035:11(11093-11111)Online publication date: 1-Nov-2023
    • (2022)Elastic-Degenerate String Matching with 1 ErrorLATIN 2022: Theoretical Informatics10.1007/978-3-031-20624-5_2(20-37)Online publication date: 7-Nov-2022
    • (2021)Reverse-Safe Text IndexingACM Journal of Experimental Algorithmics10.1145/346169826(1-26)Online publication date: 9-Jul-2021
    • (2020)Efficient Enumeration of Distinct Factors Using Package RepresentationsString Processing and Information Retrieval10.1007/978-3-030-59212-7_18(247-261)Online publication date: 17-Sep-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media