Abstract
Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches \(m+k\) characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.
Similar content being viewed by others
References
Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23
Baeza-Yates RA, Navarro G (1999) Faster approximate string matching. Algorithmica 23(2):127–158
Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. ACM J Exp Algorithmics 16:1.1:1.1–1.1:1.91. doi:10.1145/1963190.1963191
Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812
Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184
Danek A, Deorowicz S, Grabowski S (2014) Indexing large genome collections on a PC. In: CoRR. arXiv:1403.7481
French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160
Hyyrö H, Fredriksson K, Navarro G (2005) Increased bit-parallelism for approximate and multiple string matching. J Exp Algorithmics 10:2.6. doi:10.1145/1064546.1180617
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710
Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239
Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373
Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160
Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137
Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545
Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91
Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972
Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USA
Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172
Acknowledgments
This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F00. Partial Experiments are conducted on the Xing cloud of the DEKE Lab, Renmin University of China.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, H., Wang, H., Li, J. et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst 49, 121–141 (2016). https://doi.org/10.1007/s10115-015-0896-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0896-6