[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Survey: An overview on XML similarity: Background, current trends and future directions

Published: 01 August 2009 Publication History

Abstract

In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. In this paper, we provide an overview of XML similarity/comparison by presenting existing research related to XML similarity. We also detail the possible applications of XML comparison processes in various fields, ranging over data warehousing, data integration, classification/clustering and XML querying, and discuss some required and emergent future research directions.

References

[1]
Tai, K.C., The tree-to-tree correction problem. Journal of the ACM. v26. 422-433.
[2]
Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing. v18 i6. 1245-1262.
[3]
Shasha, D. and Zhang, K., Approximate tree pattern matching. In: Pattern Matching in Strings, Trees and Arrays, Oxford University Press.
[4]
S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change detection in hierarchically structured information, in: Proceedings ACM SIGMOD, Canada, 1996, pp. 26-37
[5]
G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in XML documents, in: Proceedings of the IEEE International Conference on Data Engineering, 2002, pp. 41-52
[6]
S. Chawathe, Comparing hierarchical data in external memory, in: Proceedings of the VLDB Conference, 1999, pp. 90-101
[7]
A. Nierman, H.V. Jagadish, Evaluating structural similarity in XML documents, in: Proceedings of the 5th ACM SIGMOD International Workshop on the Web and Databases, WebDB, 2002, pp. 61-66
[8]
Dalamagas, T., Cheng, T., Winkel, K. and Sellis, T., A methodology for clustering XML documents by structure. Information Systems. v31 i3. 187-228.
[9]
Tekli, J., Chbeir, R. and Yetongnon, K., Structural similarity evaluation between XML documents and DTDs. In: LNCS, vol. 4831. Springer-Verlag, Berlin Heidelberg. pp. 196-201.
[10]
N. Fuhr, K. Groíjohann, XIRQL: A Query Language for Information Retrieval. In: Proceedings of ACM-SIGIR, New Orleans, 2001, pp. 172-180
[11]
Chinenyanga, T.T. and Kushmerick, N., An expressive and efficient language for XML information retrieval. Journal of the American Society for Information Science. v53 i6. 438-453.
[12]
T. Grabs, H.-J. Schek, Generating vector spaces on-the-fly for flexible XML retrieval, in: Proceedings of ACM SIGIR'02 Workshop on XML and information Retrieval, 2002, pp. 4-13
[13]
D. Carmel, N. Efraty, G.M. Landau, Y.S. Maarek, Y. Mass, An extension of the vector space model for querying XML documents via XML fragments, in: Proceedings of the ACM SIGIR'02 Workshop on XML and Information Retrieval, 2002, pp. 14-25
[14]
Schlieder, T. and Meuss, H., Querying and ranking XML documents. Journal of the American Society for Information Science, Spec. Top. XML/IR. v53 i6. 489-503.
[15]
S. Amer-Yahia, L.K.S. Lakshmanan, S. Pandit, FleXPath: Flexible structure and full-text querying for XML, in: Proceedings of ACM SIGMOD, 2004 pp. 83-94
[16]
Pokorny, J. and Rejlek, V., A matrix model for XML data. In: Barzdins, J., Caplinskas, A. (Eds.), Frontiers in Artificial Intelligence and Applications, vol. 118. IOS Press. pp. 53-64.
[17]
D. Buttler, A short survey of document structure similarity algorithms, in: Proceedings of the 5th International Conference on internet Computing, USA, 2004, pp. 3-9
[18]
H.P. Kriegel, S. Schönauer, Similarity search in structured data, in: Proceedings of the 5th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 03, Czech Republic, 2003, pp. 309-319
[19]
Lian, W., Cheung, D., Mamoulis, N. and Yiu, S.M., An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering. v16 i1. 82-96.
[20]
D. Rafiei, D. Moise, D. Sun, Finding syntactic similarities between XML documents, in: Proceedings of the 17th International Conference on Database and Expert Systems Applications, DEXA, 2006, pp. 512-516
[21]
S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi, A bag of paths model for measuring structural similarity in web documents, in: Proceedings of the ACM SIGKKD Conference on Knowledge Discovery and Data Mining, USA, 2003, pp. 577-582
[22]
S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Detecting structural similarities between XML documents, in: Proceedings of the 5th International ACM SIGMOD Workshop on The Web and Databases, WebDB, 2002, pp. 55-60
[23]
S. Helmer, Measuring the structural similarity of semistructured documents using entropy, in: Proceedings of the VLDB'07 Conference, 2007, pp. 1022-1032
[24]
I. Sanz, M. Mesiti, G. Guerrini, R. Berlanga Lavori, Approximate subtree identification in heterogeneous XML documents collections, in: XML Symposium, 2005, pp. 192-206
[25]
Bertino, E., Guerrini, G. and Mesiti, M., A matching algorithm for measuring the structural similarity between an XML documents and a DTD and its applications. Elsevier Computer Science. v29. 23-46.
[26]
Liang, W. and Yokota, H., LAX: An efficient approximate XML join based on clustered leaf nodes for XML data integration. In: LNCS, vol. 3567. Springer. pp. 82-97.
[27]
A.M. Kade, C.A. Heuser, Matching XML documents in highly dynamic applications, in: Proceeding of the 8th ACM Symposium on Document Engineering, DocEng'08, Brazil, 2008, pp. 191-198
[28]
M. Weis, F. Naumann, Dogmatix tracks down duplicates in XML, in: Proceedings of the ACM SIGMOD Conference, USA, 2005, pp. 431-442
[29]
L. Leitao, P. Calado, M. Weis, Structure-based inference of XML similarity for fuzzy duplicate detection, in: Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM'07, Portugal, 2007, pp. 293-302
[30]
C.F. Dorneles, C.A. Heuser, A.E.N. Lima, A.S. da Silva, E.S. de Moura, Measuring similarity between collections of values, in: Proceedings of the ACM international Workshop on Web Information and Data Management, USA, 2004, pp. 56-63
[31]
D. Fallside, P. Walmsley, XML Schema part 0: Primer second edition W3C, October 2004. http://www.w3.org/TR/xmlschema-0/
[32]
WWW consortium, The Document Object Model. http://www.w3.org/DOM
[33]
Z. Zhang, R. Li, S. Cao, Y. Zhu, Similarity metric in XML documents, in: Knowledge Management and Experience Management Workshop, Germany, 2003
[34]
S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, T. Yu, Approximate XML joins, in: Proceedings of ACM SIGMOD, 2002, pp. 287-298
[35]
T. Schlieder, Similarity search in XML data using cost-based query transformations, in: Proceedings of the 4th ACM SIGMOD International Workshop on the Web and Databases, WebDB, 2001, pp. 19-24
[36]
Levenshtein, V., Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. v6. 707-710.
[37]
Wagner, J. and Fisher, M., The string-to-string correction problem. Journal of the Association of Computing Machinery. v21 i1. 168-173.
[38]
Wong, C. and Chandra, A., Bounds for the string editing problem. Journal of the Association for Computing Machinery. v23 i1. 13-16.
[39]
S. Chawathe, H. Garcia-Molina, Meaningful change detection in structured data, in: Proceedings of ACM SIGMOD, 1997, pp. 26-37
[40]
An O(ND) difference algorithm and its variations. Algorithmica. v1 i2. 251-266.
[41]
J. Tekli, R. Chbeir, K. Yetongnon, Semantic and structure based XML similarity: An Integrated Approach, in: Proceedings of the 13th Interventional Conference on Management of Data, COMAD'06, New Delhi, India, 2006, pp. 32-43
[42]
Aho, A., Hirschberg, D. and Ullman, J., Bounds on the complexity of the longest common subsequence problem. Association for Computing Machinery. v23 i1. 1-12.
[43]
Salton, G., The SMART Retrieval System. 1971. Prentice Hall, New Jersey.
[44]
Boughanem, M., Introduction to information retrieval. In: Proceedings of EARIA 06 (Ecole d'Automne en Recherche d'Information et Application),
[45]
Rijsbergen Van, C.J., Information Retrieval. 1979. Butterworths, London.
[46]
Agrawal, R., Faloutsos, C. and Swami, A.N., Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on the Foundations of Data Organization and Algorithms, FODO'93, Springer Verlag. pp. 69-165.
[47]
Roussopoulos, N., Kelley, S. and Vincent, F., Nearest neighbor queries. In: Proceedings of ACM SIGMOD, ACM Press. pp. 71-79.
[48]
Salton, G. and Mcgill, M.J., Introduction to Modern Information Retrieval. 1983. McGraw-Hill, Tokio.
[49]
Lee, J.H., Properties of extended Boolean models in information retrieval. In: Proceedings of the ACM SIGIR Conference, Springer-Verlag, New York. pp. 182-190.
[50]
Fuhr, N., Probabilistic models in information retrieval. The Computer Journal. v35 i3. 243-255.
[51]
Deerwester, S., Dumais, S., Furnas, G., Landauer, T. and Harshman, R., Indexing by latent semantic analysis. Journal of the American Society for Information Science. v41 i6. 391-407.
[52]
Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems. v20 i4. 357-389.
[53]
W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, in: Proceedings of ACM SIGMOD, 1998, pp. 291-211
[54]
A. Doucet, L. Aunimo, M. Lehtonen, R. Petit, Accurate retrieval of XML document fragments using EXTRIP, in: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval, INEX'2003
[55]
Ganesan, P., Garcia-Molina, H. and Windom, J., Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems, TOIS. v21 i1. 64-93.
[56]
D. Lin, An information-theoretic definition of similarity, in: Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 296-304
[57]
Salton, G. and Buckley, C., Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal. v24 i5. 513-523.
[58]
A. Marian, S. Amer-Yahia, N. Koudas, D. Srivastava, Adaptive processing of Top-k queries in XML, in: Proceedings of the ICDE Conference, 2005, pp. 162-173
[59]
Hirschberg, D.S., A linear space algorithm for computing maximal common subsequences. Communications of the ACM. v18 i6. 341-343.
[60]
H. Meuss, Logical tree matching with complete answer aggregates for retrieving structured documents, Ph.D. Thesis, University of Munich, 2000
[61]
R. Goldman, J. Widom, DataGuides: Enabling query formulation and optimization in semistructured databases, in: Proceedings of the VLDB Conference, 1997, pp. 436-445
[62]
A.Z. Broder, On the Resemblance and Containment of Documents, in: Proceedings of Compression and Complexity of SEQUENCES, 1997, pp. 21-29
[63]
L. Candillier, I. Tellier, F. Torre, Transforming XML trees for efficient classification and clustering, in: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval, INEX'05, 2005, pp. 469-480
[64]
R. Quinlan, Data mining tools see5 and c5.0, 2004
[65]
Candillier, L., Tellier, I., Torre, F. and Bouquet, O., SSC: Statistical clustering. In: Perner, P., Imiya, A. (Eds.), LNCS, vol. LNAI 3587. pp. 100-109.
[66]
A. Marian, S. Abiteboul, L. Mignet, Change-centric management of versions in an XML warehouse, in: Proceedings of the VLDB Conference, 2001, pp. 581-590
[67]
Y. Wang, D.J. Dewitt, J.Y. Cai, X-Diff: An effective change detection algorithm for XML documents, in: Proceedings of the ICDE Conference, 2003, pp. 519-530
[68]
N. Webber, C. O'connel, B. Hunt, R. Levine, L. Popkin, G. Larose, The information and content exchange (ICE) protocol, 2000. http://www.w3.org/TR/NOTE-ice
[69]
B. Nguyen, S. Abiteboul, G. Cobena, M. Preda, Monitoring XML data on the web, in: Proceedings of ACM SIGMOD, 2001, pp. 437-448
[70]
H. Schöning, Tamoni-A DBMS designed for XML, in: Proceedings of the ICDE Conference, 2001, pp. 149-154
[71]
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadrt, K. Shim, Xtract: A system for extracting document type descriptors form XML documents, in: Proceedings of ACM SIGMOD, 2000, pp. 165-176
[72]
S. Guha, N. Koudas, D. Srivastava, T. Yu, Index-based approximate XML joins, in: Proceedings of the ICDE Conference, 2003, pp. 708-710
[73]
Liang, W. and Yokota, H., SLAX: An improved leaf-clustering based approximate XML join algorithm for integrating XML data at subtree classes. Transactions of Information Processing Society of Japan. v47. 47-57.
[74]
XML-QL: A query language for XML. Computer Networks. v31 i11-16. 1155-1169.
[75]
J. Robie, XQL (XML Query Language), 1999. http://metalab.unc.edu/xql/xql-proposal.xml
[76]
D. Chamberlin, D. Florescu, J. Robie, J. Simeon, M. Stefanescu, XQuery: A query language for XML, 2001. http://www.w3.org/TR/2001/WD-xquery-20010215
[77]
Cohen, W., WHIRL: A word-based information representation language. Journal of Artificial Intelligence. v118. 163-196.
[78]
A. Theobald, G. Weikum, Adding relevance to XML, in: Proceedings of the 3rd International Workshop on the Web Databases, WebDB'00, USA, 2000, pp. 105-124
[79]
J.-M. Bremer, M. Gertz, XQuery/IR: Integrating XML document and data retrieval, in: Proceedings of the 5th ACM SIGMOD International Workshop on the Web and Databases, WebDB'02, 2002, pp. 1-6
[80]
A.G. Maguitman, F. Menczer, H. Roinestad, A. Vespignani, Algorithmic detection of semantic similarity, in: Proceedings of the 14th International WWW Conference, Japan, 2005, pp. 107-116
[81]
Schenkel, R., Theobald, A. and Weikum, G., Semantic similarity search on semistructured data with the XXL search engine. Information Retrieval. v8. 521-545.
[82]
WordNet: An on-line lexical database. International Journal of Lexicography. v3. 235-244.
[83]
Manning, C.D., Raghavan, P. and Schütze, H., Introduction to Information Retrieval. 2008. Cambridge UP.

Cited By

View all
  1. Survey: An overview on XML similarity: Background, current trends and future directions

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Computer Science Review
      Computer Science Review  Volume 3, Issue 3
      August, 2009
      69 pages

      Publisher

      Elsevier Science Publishers B. V.

      Netherlands

      Publication History

      Published: 01 August 2009

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Phishing Attacks Modifications and EvolutionsComputer Security10.1007/978-3-319-99073-6_12(243-262)Online publication date: 3-Sep-2018
      • (2016)An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.252576828:6(1383-1407)Online publication date: 1-Jun-2016
      • (2016)Building semantic trees from XML documentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00237:C(1-24)Online publication date: 1-Mar-2016
      • (2016)Clustering XML documents by patternsKnowledge and Information Systems10.1007/s10115-015-0820-046:1(185-212)Online publication date: 1-Jan-2016
      • (2015)Approximate XML structure validation based on document-grammar tree similarityInformation Sciences: an International Journal10.1016/j.ins.2014.09.044295:C(258-302)Online publication date: 20-Feb-2015
      • (2014)Measuring the similarity of PML documents with RFID-based sensorsInternational Journal of Ad Hoc and Ubiquitous Computing10.1504/IJAHUC.2014.06576417:2/3(174-185)Online publication date: 1-Nov-2014
      • (2014)Learning to Match Heterogeneous Structures using Partially Labeled DataProceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning10.1145/2663792.2663797(45-48)Online publication date: 3-Nov-2014
      • (2014)Top-Down Tree Edit-Distance of Regular Tree LanguagesProceedings of the 8th International Conference on Language and Automata Theory and Applications - Volume 837010.1007/978-3-319-04921-2_38(466-477)Online publication date: 10-Mar-2014
      • (2013)An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML DocumentsProceedings of International Conference on Information Integration and Web-based Applications & Services10.1145/2539150.2539260(518-522)Online publication date: 2-Dec-2013
      • (2013)Similarity evaluation in XML schema and XLinkProceedings of the 19th Brazilian symposium on Multimedia and the web10.1145/2526188.2526222(153-156)Online publication date: 5-Nov-2013
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media