[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3299869.3319892acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

A Scalable Index for Top-k Subtree Similarity Queries

Published: 25 June 2019 Publication History

Abstract

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required.
We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly.
In an extensive empirical evaluation on synthetic and real-world data, our technique consistently outperforms the state-of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to four orders of magnitude.

References

[1]
Simian - Similarity Analyzer | Duplicate Code Detection for the Enterprise. https://www.harukizaemon.com/simian/. Accessed: 2019-02--11, 02:29 PM.
[2]
R. Akbarinia, E. Pacitti, and P. Valduriez. Best position algorithms for top-k queries. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 495--506. VLDB Endowment, 2007.
[3]
T. Akutsu. Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics. IEICE Transactions on Information and Systems, E93.D(2):208--218, 2010.
[4]
N. Augsten, D. Barbosa, M. Böhlen, and T. Palpanas. TASM: Top-k Approximate Subtree Matching. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pages 353--364, March 2010.
[5]
N. Augsten, D. Barbosa, M. Bohlen, and T. Palpanas. Efficient Top-k Approximate Subtree Matching in Small Memory. IEEE Transactions on Knowledge and Data Engineering, 23(8):1123--1137, Aug 2011.
[6]
N. Augsten, M. Böhlen, and J. Gamper. Approximate Matching of Hierarchical Data Using pq-grams. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 301--312. VLDB Endowment, 2005.
[7]
P. Bille. A survey on tree edit distance and related problems. Theoretical Computer Science, 337(1--3):217--239, 2005.
[8]
P. Calado, M. Herschel, and L. Leit a o. An overview of XML duplicate detection algorithms. In Soft Computing in XML Data Management - Intelligent Systems from Decision Making to Data Mining, Web Intelligence and Computer Vision, volume 255 of Studies in Fuzziness and Soft Computing, pages 193--224. Springer, 2010.
[9]
S. Cohen. Indexing for Subtree Similarity-Search Using Edit Distance. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 49--60, New York, NY, USA, 2013. ACM.
[10]
S. Cohen and N. Or. A General Algorithm for Subtree Similarity-Search. In 2014 IEEE 30th International Conference on Data Engineering, pages 928--939, March 2014.
[11]
E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms, 6(1), 2009.
[12]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '01, pages 102--113, New York, NY, USA, 2001. ACM.
[13]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, June 2003.
[14]
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE '14, pages 313--324, New York, NY, USA, 2014. ACM.
[15]
J. Finis, R. Brunel, A. Kemper, T. Neumann, F. F"arber, and N. May. DeltaNI: An Efficient Labeling Scheme for Versioned Hierarchical Data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 905--916, New York, NY, USA, 2013. ACM.
[16]
T. Grust. Accelerating xpath location steps. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD '02, pages 109--120, New York, NY, USA, 2002. ACM.
[17]
C. Herrbach, A. Denise, and S. Dulucq. Average complexity of the jiang-wang-zhang pairwise tree alignment algorithm and of a rna secondary structure alignment algorithm. Theoretical Computer Science, 411(26):2423 -- 2432, 2010.
[18]
I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4):11:1--11:58, 2008.
[19]
M. Kashkoush and H. ElMaraghy. Matching bills of materials using tree reconciliation. Procedia CIRP, 7:169 -- 174, 2013. Forty Sixth CIRP Conference on Manufacturing Systems 2013.
[20]
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD '04, pages 779--790, New York, NY, USA, 2004. ACM.
[21]
F. Li, H. Wang, J. Li, and H. Gao. A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data. SIGMOD Rec., 42(4):29--39, Feb. 2014.
[22]
M. Pawlik and N. Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157 -- 173, 2016.
[23]
S. Puhlmann, M. Weis, and F. Naumann. XML duplicate detection using sorted neighborhoods. In International Conference on Extending Database Technology (EDBT), volume 3896 of Lecture Notes in Computer Science, pages 773--791. Springer, 2006.
[24]
Y. Tang, Y. Cai, and N. Mamoulis. Scaling Similarity Joins over Tree-structured Data. Proc. VLDB Endow., 8(11):1130--1141, July 2015.
[25]
M. Theobald, H. Bast, D. Majumdar, R. Schenkel, and G. Weikum. Topx: efficient and versatile top-k query processing for semistructured data. The VLDB Journal, 17(1):81--115, Jan 2008.
[26]
M. Theobald, R. Schenkel, and G. Weikum. Efficient and Self-tuning Incremental Query Expansion for Top-k Query Processing. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05, pages 242--249, New York, NY, USA, 2005. ACM.
[27]
M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB '04, pages 648--659. VLDB Endowment, 2004.
[28]
M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. a. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06, pages 292--299, New York, NY, USA, 2006. ACM.
[29]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k Set Similarity Joins. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE '09, pages 916--927, Washington, DC, USA, 2009. IEEE Computer Society.
[30]
R. Yang, P. Kalnis, and A. K. H. Tung. Similarity Evaluation on Tree-structured Data. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD '05, pages 754--765, New York, NY, USA, 2005. ACM.
[31]
K. Zhang and D. Shasha. Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. Comput., 18(6):1245--1262, Dec. 1989.

Cited By

View all
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
  • (2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
  • (2023)AS-Parser: Log Parsing Based on Adaptive SegmentationProceedings of the ACM on Management of Data10.1145/36267191:4(1-26)Online publication date: 12-Dec-2023
  • Show More Cited By

Index Terms

  1. A Scalable Index for Top-k Subtree Similarity Queries

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
    June 2019
    2106 pages
    ISBN:9781450356435
    DOI:10.1145/3299869
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. top-k query
    2. top-k subtree similarity
    3. tree edit distance
    4. tree indexing

    Qualifiers

    • Research-article

    Funding Sources

    • Austrian Science Fund

    Conference

    SIGMOD/PODS '19
    Sponsor:
    SIGMOD/PODS '19: International Conference on Management of Data
    June 30 - July 5, 2019
    Amsterdam, Netherlands

    Acceptance Rates

    SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)97
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
    • (2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
    • (2023)AS-Parser: Log Parsing Based on Adaptive SegmentationProceedings of the ACM on Management of Data10.1145/36267191:4(1-26)Online publication date: 12-Dec-2023
    • (2023)Feedforward-Aided Course Designs for Similarity SearchProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596974(14-17)Online publication date: 23-Jun-2023
    • (2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
    • (2023)Top-k Distance Queries on Large Time-Evolving GraphsIEEE Access10.1109/ACCESS.2023.331660211(102228-102242)Online publication date: 2023
    • (2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
    • (2021)Top-k Tree Similarity JoinProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482304(1939-1948)Online publication date: 26-Oct-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media