An efficient pruning strategy for approximate string matching over suffix tree

Huan Hu¹,
Hongzhi Wang¹,
Jianzhong Li¹ &
…
Hong Gao¹

393 Accesses
4 Citations
Explore all metrics

Abstract

Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches \(m+k\) characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Sequential Representation of Suffix Trie: An Empirical Evaluation

On Suffix Tree Breadth

On Suffix Tree Detection

References

Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23
Baeza-Yates RA, Navarro G (1999) Faster approximate string matching. Algorithmica 23(2):127–158
Article MathSciNet MATH Google Scholar
Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. ACM J Exp Algorithmics 16:1.1:1.1–1.1:1.91. doi:10.1145/1963190.1963191
Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812
Article Google Scholar
Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184
Danek A, Deorowicz S, Grabowski S (2014) Indexing large genome collections on a PC. In: CoRR. arXiv:1403.7481
French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160
Article MathSciNet Google Scholar
Hyyrö H, Fredriksson K, Navarro G (2005) Increased bit-parallelism for approximate and multiple string matching. J Exp Algorithmics 10:2.6. doi:10.1145/1064546.1180617
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710
MathSciNet MATH Google Scholar
Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901
Article Google Scholar
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239
MathSciNet Google Scholar
Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge
Book MATH Google Scholar
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373
Article MathSciNet MATH Google Scholar
Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160
Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137
Article MathSciNet MATH Google Scholar
Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Article MathSciNet MATH Google Scholar
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545
Google Scholar
Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91
Article Google Scholar
Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972
Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USA
Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F00. Partial Experiments are conducted on the Xing cloud of the DEKE Lab, Renmin University of China.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Huan Hu, Hongzhi Wang, Jianzhong Li & Hong Gao

Authors

Huan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, H., Wang, H., Li, J. et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst 49, 121–141 (2016). https://doi.org/10.1007/s10115-015-0896-6

Download citation

Received: 14 February 2015
Revised: 03 August 2015
Accepted: 10 October 2015
Published: 06 November 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10115-015-0896-6

An efficient pruning strategy for approximate string matching over suffix tree

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sequential Representation of Suffix Trie: An Empirical Evaluation

On Suffix Tree Breadth

On Suffix Tree Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An efficient pruning strategy for approximate string matching over suffix tree

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sequential Representation of Suffix Trie: An Empirical Evaluation

On Suffix Tree Breadth

On Suffix Tree Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation