More Web Proxy on the site http://driver.im/

research-article

Open access

A Scalable Index for Top-k Subtree Similarity Queries

Authors:

Nikolaus AugstenAuthors Info & Claims

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 1624 - 1641

https://doi.org/10.1145/3299869.3319892

Published: 25 June 2019 Publication History

Abstract

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required.

We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly.

In an extensive empirical evaluation on synthetic and real-world data, our technique consistently outperforms the state-of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to four orders of magnitude.

References

[1]

Simian - Similarity Analyzer | Duplicate Code Detection for the Enterprise. https://www.harukizaemon.com/simian/. Accessed: 2019-02--11, 02:29 PM.

[2]

R. Akbarinia, E. Pacitti, and P. Valduriez. Best position algorithms for top-k queries. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 495--506. VLDB Endowment, 2007.

Digital Library

[3]

T. Akutsu. Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics. IEICE Transactions on Information and Systems, E93.D(2):208--218, 2010.

[4]

N. Augsten, D. Barbosa, M. Böhlen, and T. Palpanas. TASM: Top-k Approximate Subtree Matching. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pages 353--364, March 2010.

[5]

N. Augsten, D. Barbosa, M. Bohlen, and T. Palpanas. Efficient Top-k Approximate Subtree Matching in Small Memory. IEEE Transactions on Knowledge and Data Engineering, 23(8):1123--1137, Aug 2011.

Digital Library

[6]

N. Augsten, M. Böhlen, and J. Gamper. Approximate Matching of Hierarchical Data Using pq-grams. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 301--312. VLDB Endowment, 2005.

Digital Library

[7]

P. Bille. A survey on tree edit distance and related problems. Theoretical Computer Science, 337(1--3):217--239, 2005.

Digital Library

[8]

P. Calado, M. Herschel, and L. Leit a o. An overview of XML duplicate detection algorithms. In Soft Computing in XML Data Management - Intelligent Systems from Decision Making to Data Mining, Web Intelligence and Computer Vision, volume 255 of Studies in Fuzziness and Soft Computing, pages 193--224. Springer, 2010.

[9]

S. Cohen. Indexing for Subtree Similarity-Search Using Edit Distance. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 49--60, New York, NY, USA, 2013. ACM.

Digital Library

[10]

S. Cohen and N. Or. A General Algorithm for Subtree Similarity-Search. In 2014 IEEE 30th International Conference on Data Engineering, pages 928--939, March 2014.

[11]

E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms, 6(1), 2009.

Digital Library

[12]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '01, pages 102--113, New York, NY, USA, 2001. ACM.

Digital Library

[13]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, June 2003.

Digital Library

[14]

J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE '14, pages 313--324, New York, NY, USA, 2014. ACM.

Digital Library

[15]

J. Finis, R. Brunel, A. Kemper, T. Neumann, F. F"arber, and N. May. DeltaNI: An Efficient Labeling Scheme for Versioned Hierarchical Data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 905--916, New York, NY, USA, 2013. ACM.

Digital Library

[16]

T. Grust. Accelerating xpath location steps. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD '02, pages 109--120, New York, NY, USA, 2002. ACM.

Digital Library

[17]

C. Herrbach, A. Denise, and S. Dulucq. Average complexity of the jiang-wang-zhang pairwise tree alignment algorithm and of a rna secondary structure alignment algorithm. Theoretical Computer Science, 411(26):2423 -- 2432, 2010.

Digital Library

[18]

I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4):11:1--11:58, 2008.

Digital Library

[19]

M. Kashkoush and H. ElMaraghy. Matching bills of materials using tree reconciliation. Procedia CIRP, 7:169 -- 174, 2013. Forty Sixth CIRP Conference on Manufacturing Systems 2013.

[20]

R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD '04, pages 779--790, New York, NY, USA, 2004. ACM.

Digital Library

[21]

F. Li, H. Wang, J. Li, and H. Gao. A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data. SIGMOD Rec., 42(4):29--39, Feb. 2014.

Digital Library

[22]

M. Pawlik and N. Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157 -- 173, 2016.

Digital Library

[23]

S. Puhlmann, M. Weis, and F. Naumann. XML duplicate detection using sorted neighborhoods. In International Conference on Extending Database Technology (EDBT), volume 3896 of Lecture Notes in Computer Science, pages 773--791. Springer, 2006.

Digital Library

[24]

Y. Tang, Y. Cai, and N. Mamoulis. Scaling Similarity Joins over Tree-structured Data. Proc. VLDB Endow., 8(11):1130--1141, July 2015.

Digital Library

[25]

M. Theobald, H. Bast, D. Majumdar, R. Schenkel, and G. Weikum. Topx: efficient and versatile top-k query processing for semistructured data. The VLDB Journal, 17(1):81--115, Jan 2008.

Digital Library

[26]

M. Theobald, R. Schenkel, and G. Weikum. Efficient and Self-tuning Incremental Query Expansion for Top-k Query Processing. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05, pages 242--249, New York, NY, USA, 2005. ACM.

Digital Library

[27]

M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB '04, pages 648--659. VLDB Endowment, 2004.

Digital Library

[28]

M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. a. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 292--299, New York, NY, USA, 2006. ACM.

Digital Library

[29]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k Set Similarity Joins. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE '09, pages 916--927, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[30]

R. Yang, P. Kalnis, and A. K. H. Tung. Similarity Evaluation on Tree-structured Data. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD '05, pages 754--765, New York, NY, USA, 2005. ACM.

Digital Library

[31]

K. Zhang and D. Shasha. Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. Comput., 18(6):1245--1262, Dec. 1989.

Digital Library

Cited By

Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Mizokami TBou SAmagasa T(2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_6
Chen XWang PChen JWang W(2023)AS-Parser: Log Parsing Based on Adaptive SegmentationProceedings of the ACM on Management of Data10.1145/36267191:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626719
Show More Cited By

Index Terms

A Scalable Index for Top-k Subtree Similarity Queries
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Top-k retrieval in databases

Recommendations

Top-k Tree Similarity Join
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k ...
Efficient Top-k Approximate Subtree Matching in Small Memory

We consider the Top-k Approximate Subtree Matching (tasm) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit ...
Top-k best probability queries on probabilistic data
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In relation to probabilistic data, the most common problem in answering top-k ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Copyright © 2019 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Results Reproduced / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

Austrian Science Fund

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,087
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)15

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Mizokami TBou SAmagasa T(2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_6
Chen XWang PChen JWang W(2023)AS-Parser: Log Parsing Based on Adaptive SegmentationProceedings of the ACM on Management of Data10.1145/36267191:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626719
Hütter TKocher DAivaloglou EFletcher GMiedema D(2023)Feedforward-Aided Course Designs for Similarity SearchProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596974(14-17)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.1145/3596673.3596974
Papadakis GFisichella MSchoger FMandilaras GAugsten NNejdl W(2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00389
D’ascenzo AD’emidio M(2023)Top-k Distance Queries on Large Time-Evolving GraphsIEEE Access10.1109/ACCESS.2023.331660211(102228-102242)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3316602
Hütter TAugsten NKirsch CCarey MLi CIves ZBonifati AEl Abbadi A(2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517850
Wang JYang JZhang WDemartini GZuccon GCulpepper JHuang ZTong H(2021)Top-k Tree Similarity JoinProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482304(1939-1948)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482304

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents