[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Scaling similarity joins over tree-structured data

Published: 01 July 2015 Publication History

Abstract

Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-of-the-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join result based on distance bounds derived by the approximations. In this paper, we propose a novel similarity join approach, which is based on the dynamic decomposition of the tree objects into subgraphs, according to the similarity threshold. Our technique avoids computing the exact distance between two tree objects, if the objects do not share at least one common subgraph. In order to scale up the join, the computed subgraphs are managed in a two-layer index. Our experimental results on real and synthetic data collections show that our approach outperforms the state-of-the-art methods by up to an order of magnitude.

References

[1]
T. Akutsu, D. Fukagawa, and A. Takasu. Approximating tree edit distance through string edit distance. Algorithmica, 57(2):325--348, 2010.
[2]
J. Albert. Algebraic properties of bag data types. In VLDB, pages 211--219, 1991.
[3]
N. Augsten, D. Barbosa, M. H. Böhlen, and T. Palpanas. TASM: Top-k approximate subtree matching. In ICDE, pages 353--364, 2010.
[4]
N. Augsten and M. H. Böhlen. Similarity Joins in Relational Database Systems. Morgan & Claypool Publishers, 2013.
[5]
N. Augsten, M. H. Böhlen, C. E. Dyreson, and J. Gamper. Windowed pq-grams for approximate joins of data-centric XML. VLDB J., 21(4):463--488, 2012.
[6]
G. Blin, A. Denise, S. Dulucq, C. Herrbach, and H. Touzet. Alignments of RNA structures. IEEE/ACM Trans. Comput. Biology Bioinform., 7(2):309--322, 2010.
[7]
S. Cohen. Indexing for subtree similarity-search using edit distance. In SIGMOD, pages 49--60, 2013.
[8]
S. Cohen and N. Or. A general algorithm for subtree similarity-search. In ICDE, pages 928--939, 2014.
[9]
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. pages 288--321, 2000.
[10]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009.
[11]
E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms, 6(1), 2009.
[12]
S. Fries, B. Boden, G. Stepien, and T. Seidl. Phidj: Parallel similarity self-join for high-dimensional vector data with mapreduce. In ICDE, pages 796--807, 2014.
[13]
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Integrating XML data sources using approximate joins. ACM Trans. Database Syst., 31(1):161--207, 2006.
[14]
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006.
[15]
T. Jiang, L. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. Theor. Comput. Sci., 143(1):137--148, 1995.
[16]
K. Kailing, H.-P. Kriegel, S. Schönauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In EDBT, pages 676--693, 2004.
[17]
P. N. Klein. Computing the edit-distance between unrooted ordered trees. In ESA, pages 91--102, 1998.
[18]
F. Li, H. Wang, J. Li, and H. Gao. A survey on tree edit distance lower bound estimation techniques for similarity join on XML data. SIGMOD Record, 42(4):29--39, 2014.
[19]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[20]
M. Pawlik and N. Augsten. RTED: A robust algorithm for the tree edit distance. PVLDB, 5(4):334--345, 2011.
[21]
A. Rheinlnder, M. Knobloch, N. Hochmuth, and U. Leser. Prefix tree indexing for similarity search and similarity joins on genomic data. In SSDBM, pages 519--536, 2010.
[22]
R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
[23]
K. Tai. The tree-to-tree correction problem. J. ACM, 26(3):422--433, 1979.
[24]
E. Tanaka and K. Tanaka. The tree-to-tree editing problem. IJPRAI, 2(2):221--240, 1988.
[25]
S. Tatikonda and S. Parthasarathy. Hashing tree-structured data: Methods and applications. In ICDE, pages 429--440, 2010.
[26]
A. Torsello, A. Robles-Kelly, and E. R. Hancock. Discovering shape classes using tree edit-distance and pairwise clustering. International Journal of Computer Vision, 72(3):259--285, 2007.
[27]
R. Yang, P. Kalnis, and A. K. H. Tung. Similarity evaluation on tree-structured data. In SIGMOD, pages 754--765, 2005.
[28]
M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng., 17(8):1021--1035, 2005.
[29]
K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245--1262, 1989.
[30]
X. Zhao, C. Xiao, X. Lin, Q. Liu, and W. Zhang. A partition-based approach to structure similarity search. PVLDB, 7(3):169--180, 2013.

Cited By

View all
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
  • (2019)A Scalable Index for Top-k Subtree Similarity QueriesProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319892(1624-1641)Online publication date: 25-Jun-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 11
July 2015
264 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2015
Published in PVLDB Volume 8, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
  • (2019)A Scalable Index for Top-k Subtree Similarity QueriesProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319892(1624-1641)Online publication date: 25-Jun-2019
  • (2018)Submodularity of Distributed Join ComputationProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183728(1237-1252)Online publication date: 27-May-2018

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media