An Approach for XML Similarity Join Using Tree Serialization

Lianzi Wen¹,
Toshiyuki Amagasa^1,2 &
Hiroyuki Kitagawa^1,2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4947))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1028 Accesses
2 Citations

Abstract

This paper proposes a scheme for similarity join over XML data based on XML data serialization and subsequent similarity matching over XML node subsequences. With the recent explosive diffusion of XML, great volumes of electronic data are now marked up with XML. As a consequence, a growing amount of XML data represents similar contents, but with dissimilar structures. To extract as much information as possible from this heterogeneous information, similarity join has been used. Our proposed similarity join for XML data can be summarized as follows: 1) we serialize XML data as XML node sequences; 2) we extract semantically/structurally coherent subsequences; 3) we filter out dissimilar subsequences using textual information; and 4) we extract pairs of subsequences as the final result by checking structural similarity. The above process is costly to execute. To make it scalable against large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

SETJoin: a novel top-k similarity join algorithm

Article 06 March 2020

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

FrepJoin: an efficient partition-based algorithm for edit similarity join

Article 01 October 2017

References

W3C: Extensible Markup Language (XML) 1.0 (Fourth Edition) (August 2006)Recommendation, http://www.w3.org/TR/REC-xml/
Wang, H., Park, S., Fan, W., Yu, P.S.: ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. In: Proc. the 2003 ACM-SIGMOD Conference (SIGMOD), pp. 110–121 (2003)
Google Scholar
Zhang, N., Kacholia, V., Ozsu, M.T.: A succinct physical storage scheme for efficient evaluation of path queries in Xml. In: Proc. ICDE 2004, p. 54 (2004)
Google Scholar
Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Path Expressions. In: Proc. VLDB 2001, pp. 361–370 (2001)
Google Scholar
Liang, W., Yokota, H.: A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins. In: Proc. The 2nd Int’l Special Workshop on Databases for Next-Generation Researchers (SWOD) 2006, p. 116 (2006)
Google Scholar
Zhang, K., Shasha, D.: Tree pattern matching. Pattern Matching Algorithms. In: Tree pattern matching. Pattern Matching Algorithms, vol. 11, Oxford University Press, Oxford (1997)
Google Scholar
Rao, P., Moon, B.: PRIX: Indexing And Querying XML Using Prufer Sequences. In: Proc. ICDE 2004, p. 288 (2004)
Google Scholar
Gong, X., Qian, W., Yan, Y., Zhou, A.: Bloom Filter based XML Packets Filtering for Millions of Path Queries. In: Proc. ICDE 2005, pp. 890–901 (2005)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE 2006, p. 5 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Graduate School of Systems and Information Engineering,
Lianzi Wen, Toshiyuki Amagasa & Hiroyuki Kitagawa
Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan
Toshiyuki Amagasa & Hiroyuki Kitagawa

Authors

Lianzi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Amagasa
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jayant R. Haritsa Ramamohanarao Kotagiri Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, L., Amagasa, T., Kitagawa, H. (2008). An Approach for XML Similarity Join Using Tree Serialization. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds) Database Systems for Advanced Applications. DASFAA 2008. Lecture Notes in Computer Science, vol 4947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78568-2_47

Download citation

DOI: https://doi.org/10.1007/978-3-540-78568-2_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78567-5
Online ISBN: 978-3-540-78568-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics