Abstract
In this paper, we investigate the q -gram distance for ordered unlabeled trees (trees, for short). First, we formulate a q-gram as simply a tree with q nodes isomorphic to a line graph, and the q-gram distance between two trees as similar as one between two strings. Then, by using the depth sequence based on postorder, we design the algorithm EnumGram to enumerate all q-grams in a tree T with n nodes which runs in O(n 2) time and in O(q) space. Furthermore, we improve it to the algorithm LinearEnumGram which runs in O(qn) time and in O(qd) space, where d is the depth of T. Hence, we can evaluate the q-gram distance D q (T 1,T 2) between T 1 and T 2 in O(q maxn 1, n 2) time and in O(q maxd 1, d 2) space, where n i and d i are the number of nodes in T i and the depth of T i , respectively. Finally, we show the relationship between the q-gram distance D q (T 1,T 2) and the edit distanceE(T 1,T 2) that D q (T 1,T 2)≤ (gl+1)E(T 1,T 2), where g = max{g 1, g 2}, l = max{l 1, l 2}, g i is the degree of T i and l i is the number of leaves in T i . In particular, for the top-down tree edit distanceF(T 1,T 2), this result implies that \(D_{q}(T_{1}, T_{2}) \leq {\rm min}\{g^{q-2}, l - 1\}\{F(T_{1}, T_{2})\}\).
This work is partially supported by Grand-in-Aid for Scientific Research 15700137, 16016275 and 17700138 from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Asai, T., Abe, K., Kawazoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. SDM 2002 (2002)
Asai, T., Arimura, H., Nakano, S., Uno, T.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337, 217–239 (2005)
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001)
Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y.: Extracting characteristic structures among words in semistructured documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 351–360. Springer, Heidelberg (2002)
Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In: Proc. PODS 2003, pp. 143–154 (2003)
Ikeda, D., Yamada, Y., Hirokawa, S.: Eliminating useless parts in semi-structured documents using alternation counts. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 113–127. Springer, Heidelberg (2001)
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)
Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) ICTCS 2005. LNCS, vol. 3701, pp. 323–337. Springer, Heidelberg (2005)
Nakano, S., Uno, T.: Efficient generation of rooted trees. National Institute of Informatics Technical Report NII-2003-005E (2003)
Nakano, S., Uno, T.: Constant time generation of trees with specified diameter. In: Hromkovič, J., Nagl, M., Westfechtel, B. (eds.) WG 2004. LNCS, vol. 3353, pp. 33–45. Springer, Heidelberg (2004)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)
Selkow, S.M.: The tree-to-tree editing problem. Inform. Proc. Let. 6, 184–186 (1997)
Shasha, D., Zhang, K.: Fast algorithms for the unit cost edit distance between trees. J. Algo. 11, 581–621 (1990)
Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1993)
Yang, W.: Identifying syntactic differences between two programs. Software–Practice and Experience 21, 739–755 (1991)
Zaki, M.J.: Efficiently mining frequent trees in a forest. In: Proc. SIGKDD 2002, pp. 71–80 (2002)
Zhang, K., Shasha, D.: Tree pattern matching. In: Apostolico, A., Galil, Z. (eds.) Pattern matching algorithms, pp. 341–371 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ohkura, N., Hirata, K., Kuboyama, T., Harao, M. (2005). The q-Gram Distance for Ordered Unlabeled Trees. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_17
Download citation
DOI: https://doi.org/10.1007/11563983_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29230-2
Online ISBN: 978-3-540-31698-5
eBook Packages: Computer ScienceComputer Science (R0)