More Web Proxy on the site http://driver.im/

article

Survey: An overview on XML similarity: Background, current trends and future directions

Authors:

Richard Chbeir,

Kokou YetongnonAuthors Info & Claims

Computer Science Review, Volume 3, Issue 3

Pages 151 - 173

https://doi.org/10.1016/j.cosrev.2009.03.001

Published: 01 August 2009 Publication History

Abstract

In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. In this paper, we provide an overview of XML similarity/comparison by presenting existing research related to XML similarity. We also detail the possible applications of XML comparison processes in various fields, ranging over data warehousing, data integration, classification/clustering and XML querying, and discuss some required and emergent future research directions.

References

[1]

Tai, K.C., The tree-to-tree correction problem. Journal of the ACM. v26. 422-433.

Digital Library

[2]

Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing. v18 i6. 1245-1262.

Digital Library

[3]

Shasha, D. and Zhang, K., Approximate tree pattern matching. In: Pattern Matching in Strings, Trees and Arrays, Oxford University Press.

[4]

S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change detection in hierarchically structured information, in: Proceedings ACM SIGMOD, Canada, 1996, pp. 26-37

Digital Library

[5]

G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in XML documents, in: Proceedings of the IEEE International Conference on Data Engineering, 2002, pp. 41-52

Digital Library

[6]

S. Chawathe, Comparing hierarchical data in external memory, in: Proceedings of the VLDB Conference, 1999, pp. 90-101

Digital Library

[7]

A. Nierman, H.V. Jagadish, Evaluating structural similarity in XML documents, in: Proceedings of the 5th ACM SIGMOD International Workshop on the Web and Databases, WebDB, 2002, pp. 61-66

[8]

Dalamagas, T., Cheng, T., Winkel, K. and Sellis, T., A methodology for clustering XML documents by structure. Information Systems. v31 i3. 187-228.

Digital Library

[9]

Tekli, J., Chbeir, R. and Yetongnon, K., Structural similarity evaluation between XML documents and DTDs. In: LNCS, vol. 4831. Springer-Verlag, Berlin Heidelberg. pp. 196-201.

Digital Library

[10]

N. Fuhr, K. Groíjohann, XIRQL: A Query Language for Information Retrieval. In: Proceedings of ACM-SIGIR, New Orleans, 2001, pp. 172-180

Digital Library

[11]

Chinenyanga, T.T. and Kushmerick, N., An expressive and efficient language for XML information retrieval. Journal of the American Society for Information Science. v53 i6. 438-453.

Digital Library

[12]

T. Grabs, H.-J. Schek, Generating vector spaces on-the-fly for flexible XML retrieval, in: Proceedings of ACM SIGIR'02 Workshop on XML and information Retrieval, 2002, pp. 4-13

[13]

D. Carmel, N. Efraty, G.M. Landau, Y.S. Maarek, Y. Mass, An extension of the vector space model for querying XML documents via XML fragments, in: Proceedings of the ACM SIGIR'02 Workshop on XML and Information Retrieval, 2002, pp. 14-25

[14]

Schlieder, T. and Meuss, H., Querying and ranking XML documents. Journal of the American Society for Information Science, Spec. Top. XML/IR. v53 i6. 489-503.

Digital Library

[15]

S. Amer-Yahia, L.K.S. Lakshmanan, S. Pandit, FleXPath: Flexible structure and full-text querying for XML, in: Proceedings of ACM SIGMOD, 2004 pp. 83-94

Digital Library

[16]

Pokorny, J. and Rejlek, V., A matrix model for XML data. In: Barzdins, J., Caplinskas, A. (Eds.), Frontiers in Artificial Intelligence and Applications, vol. 118. IOS Press. pp. 53-64.

[17]

D. Buttler, A short survey of document structure similarity algorithms, in: Proceedings of the 5th International Conference on internet Computing, USA, 2004, pp. 3-9

[18]

H.P. Kriegel, S. Schönauer, Similarity search in structured data, in: Proceedings of the 5th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 03, Czech Republic, 2003, pp. 309-319

[19]

Lian, W., Cheung, D., Mamoulis, N. and Yiu, S.M., An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering. v16 i1. 82-96.

Digital Library

[20]

D. Rafiei, D. Moise, D. Sun, Finding syntactic similarities between XML documents, in: Proceedings of the 17th International Conference on Database and Expert Systems Applications, DEXA, 2006, pp. 512-516

Digital Library

[21]

S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi, A bag of paths model for measuring structural similarity in web documents, in: Proceedings of the ACM SIGKKD Conference on Knowledge Discovery and Data Mining, USA, 2003, pp. 577-582

Digital Library

[22]

S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Detecting structural similarities between XML documents, in: Proceedings of the 5th International ACM SIGMOD Workshop on The Web and Databases, WebDB, 2002, pp. 55-60

[23]

S. Helmer, Measuring the structural similarity of semistructured documents using entropy, in: Proceedings of the VLDB'07 Conference, 2007, pp. 1022-1032

Digital Library

[24]

I. Sanz, M. Mesiti, G. Guerrini, R. Berlanga Lavori, Approximate subtree identification in heterogeneous XML documents collections, in: XML Symposium, 2005, pp. 192-206

Digital Library

[25]

Bertino, E., Guerrini, G. and Mesiti, M., A matching algorithm for measuring the structural similarity between an XML documents and a DTD and its applications. Elsevier Computer Science. v29. 23-46.

Digital Library

[26]

Liang, W. and Yokota, H., LAX: An efficient approximate XML join based on clustered leaf nodes for XML data integration. In: LNCS, vol. 3567. Springer. pp. 82-97.

Digital Library

[27]

A.M. Kade, C.A. Heuser, Matching XML documents in highly dynamic applications, in: Proceeding of the 8th ACM Symposium on Document Engineering, DocEng'08, Brazil, 2008, pp. 191-198

Digital Library

[28]

M. Weis, F. Naumann, Dogmatix tracks down duplicates in XML, in: Proceedings of the ACM SIGMOD Conference, USA, 2005, pp. 431-442

Digital Library

[29]

L. Leitao, P. Calado, M. Weis, Structure-based inference of XML similarity for fuzzy duplicate detection, in: Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM'07, Portugal, 2007, pp. 293-302

Digital Library

[30]

C.F. Dorneles, C.A. Heuser, A.E.N. Lima, A.S. da Silva, E.S. de Moura, Measuring similarity between collections of values, in: Proceedings of the ACM international Workshop on Web Information and Data Management, USA, 2004, pp. 56-63

Digital Library

[31]

D. Fallside, P. Walmsley, XML Schema part 0: Primer second edition W3C, October 2004. http://www.w3.org/TR/xmlschema-0/

[32]

WWW consortium, The Document Object Model. http://www.w3.org/DOM

[33]

Z. Zhang, R. Li, S. Cao, Y. Zhu, Similarity metric in XML documents, in: Knowledge Management and Experience Management Workshop, Germany, 2003

[34]

S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, T. Yu, Approximate XML joins, in: Proceedings of ACM SIGMOD, 2002, pp. 287-298

Digital Library

[35]

T. Schlieder, Similarity search in XML data using cost-based query transformations, in: Proceedings of the 4th ACM SIGMOD International Workshop on the Web and Databases, WebDB, 2001, pp. 19-24

[36]

Levenshtein, V., Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. v6. 707-710.

[37]

Wagner, J. and Fisher, M., The string-to-string correction problem. Journal of the Association of Computing Machinery. v21 i1. 168-173.

Digital Library

[38]

Wong, C. and Chandra, A., Bounds for the string editing problem. Journal of the Association for Computing Machinery. v23 i1. 13-16.

Digital Library

[39]

S. Chawathe, H. Garcia-Molina, Meaningful change detection in structured data, in: Proceedings of ACM SIGMOD, 1997, pp. 26-37

Digital Library

[40]

An O(ND) difference algorithm and its variations. Algorithmica. v1 i2. 251-266.

[41]

J. Tekli, R. Chbeir, K. Yetongnon, Semantic and structure based XML similarity: An Integrated Approach, in: Proceedings of the 13th Interventional Conference on Management of Data, COMAD'06, New Delhi, India, 2006, pp. 32-43

[42]

Aho, A., Hirschberg, D. and Ullman, J., Bounds on the complexity of the longest common subsequence problem. Association for Computing Machinery. v23 i1. 1-12.

Digital Library

[43]

Salton, G., The SMART Retrieval System. 1971. Prentice Hall, New Jersey.

[44]

Boughanem, M., Introduction to information retrieval. In: Proceedings of EARIA 06 (Ecole d'Automne en Recherche d'Information et Application),

[45]

Rijsbergen Van, C.J., Information Retrieval. 1979. Butterworths, London.

Digital Library

[46]

Agrawal, R., Faloutsos, C. and Swami, A.N., Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on the Foundations of Data Organization and Algorithms, FODO'93, Springer Verlag. pp. 69-165.

Digital Library

[47]

Roussopoulos, N., Kelley, S. and Vincent, F., Nearest neighbor queries. In: Proceedings of ACM SIGMOD, ACM Press. pp. 71-79.

Digital Library

[48]

Salton, G. and Mcgill, M.J., Introduction to Modern Information Retrieval. 1983. McGraw-Hill, Tokio.

Digital Library

[49]

Lee, J.H., Properties of extended Boolean models in information retrieval. In: Proceedings of the ACM SIGIR Conference, Springer-Verlag, New York. pp. 182-190.

Digital Library

[50]

Fuhr, N., Probabilistic models in information retrieval. The Computer Journal. v35 i3. 243-255.

Digital Library

[51]

Deerwester, S., Dumais, S., Furnas, G., Landauer, T. and Harshman, R., Indexing by latent semantic analysis. Journal of the American Society for Information Science. v41 i6. 391-407.

[52]

Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems. v20 i4. 357-389.

Digital Library

[53]

W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, in: Proceedings of ACM SIGMOD, 1998, pp. 291-211

Digital Library

[54]

A. Doucet, L. Aunimo, M. Lehtonen, R. Petit, Accurate retrieval of XML document fragments using EXTRIP, in: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval, INEX'2003

[55]

Ganesan, P., Garcia-Molina, H. and Windom, J., Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems, TOIS. v21 i1. 64-93.

Digital Library

[56]

D. Lin, An information-theoretic definition of similarity, in: Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 296-304

Digital Library

[57]

Salton, G. and Buckley, C., Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal. v24 i5. 513-523.

Digital Library

[58]

A. Marian, S. Amer-Yahia, N. Koudas, D. Srivastava, Adaptive processing of Top-k queries in XML, in: Proceedings of the ICDE Conference, 2005, pp. 162-173

Digital Library

[59]

Hirschberg, D.S., A linear space algorithm for computing maximal common subsequences. Communications of the ACM. v18 i6. 341-343.

Digital Library

[60]

H. Meuss, Logical tree matching with complete answer aggregates for retrieving structured documents, Ph.D. Thesis, University of Munich, 2000

[61]

R. Goldman, J. Widom, DataGuides: Enabling query formulation and optimization in semistructured databases, in: Proceedings of the VLDB Conference, 1997, pp. 436-445

Digital Library

[62]

A.Z. Broder, On the Resemblance and Containment of Documents, in: Proceedings of Compression and Complexity of SEQUENCES, 1997, pp. 21-29

Digital Library

[63]

L. Candillier, I. Tellier, F. Torre, Transforming XML trees for efficient classification and clustering, in: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval, INEX'05, 2005, pp. 469-480

Digital Library

[64]

R. Quinlan, Data mining tools see5 and c5.0, 2004

[65]

Candillier, L., Tellier, I., Torre, F. and Bouquet, O., SSC: Statistical clustering. In: Perner, P., Imiya, A. (Eds.), LNCS, vol. LNAI 3587. pp. 100-109.

Digital Library

[66]

A. Marian, S. Abiteboul, L. Mignet, Change-centric management of versions in an XML warehouse, in: Proceedings of the VLDB Conference, 2001, pp. 581-590

Digital Library

[67]

Y. Wang, D.J. Dewitt, J.Y. Cai, X-Diff: An effective change detection algorithm for XML documents, in: Proceedings of the ICDE Conference, 2003, pp. 519-530

[68]

N. Webber, C. O'connel, B. Hunt, R. Levine, L. Popkin, G. Larose, The information and content exchange (ICE) protocol, 2000. http://www.w3.org/TR/NOTE-ice

[69]

B. Nguyen, S. Abiteboul, G. Cobena, M. Preda, Monitoring XML data on the web, in: Proceedings of ACM SIGMOD, 2001, pp. 437-448

Digital Library

[70]

H. Schöning, Tamoni-A DBMS designed for XML, in: Proceedings of the ICDE Conference, 2001, pp. 149-154

Digital Library

[71]

M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadrt, K. Shim, Xtract: A system for extracting document type descriptors form XML documents, in: Proceedings of ACM SIGMOD, 2000, pp. 165-176

Digital Library

[72]

S. Guha, N. Koudas, D. Srivastava, T. Yu, Index-based approximate XML joins, in: Proceedings of the ICDE Conference, 2003, pp. 708-710

[73]

Liang, W. and Yokota, H., SLAX: An improved leaf-clustering based approximate XML join algorithm for integrating XML data at subtree classes. Transactions of Information Processing Society of Japan. v47. 47-57.

[74]

XML-QL: A query language for XML. Computer Networks. v31 i11-16. 1155-1169.

Digital Library

[75]

J. Robie, XQL (XML Query Language), 1999. http://metalab.unc.edu/xql/xql-proposal.xml

[76]

D. Chamberlin, D. Florescu, J. Robie, J. Simeon, M. Stefanescu, XQuery: A query language for XML, 2001. http://www.w3.org/TR/2001/WD-xquery-20010215

[77]

Cohen, W., WHIRL: A word-based information representation language. Journal of Artificial Intelligence. v118. 163-196.

Digital Library

[78]

A. Theobald, G. Weikum, Adding relevance to XML, in: Proceedings of the 3rd International Workshop on the Web Databases, WebDB'00, USA, 2000, pp. 105-124

Digital Library

[79]

J.-M. Bremer, M. Gertz, XQuery/IR: Integrating XML document and data retrieval, in: Proceedings of the 5th ACM SIGMOD International Workshop on the Web and Databases, WebDB'02, 2002, pp. 1-6

[80]

A.G. Maguitman, F. Menczer, H. Roinestad, A. Vespignani, Algorithmic detection of semantic similarity, in: Proceedings of the 14th International WWW Conference, Japan, 2005, pp. 107-116

Digital Library

[81]

Schenkel, R., Theobald, A. and Weikum, G., Semantic similarity search on semistructured data with the XXL search engine. Information Retrieval. v8. 521-545.

Digital Library

[82]

WordNet: An on-line lexical database. International Journal of Lexicography. v3. 235-244.

[83]

Manning, C.D., Raghavan, P. and Schütze, H., Introduction to Information Retrieval. 2008. Cambridge UP.

Digital Library

Cited By

Cui QJourdan GBochmann GOnut IFlood J(2018)Phishing Attacks Modifications and EvolutionsComputer Security10.1007/978-3-319-99073-6_12(243-262)Online publication date: 3-Sep-2018
https://dl.acm.org/doi/10.1007/978-3-319-99073-6_12
Tekli J(2016)An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.252576828:6(1383-1407)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1109/TKDE.2016.2525768
Tekli JCharbel NChbeir R(2016)Building semantic trees from XML documentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00237:C(1-24)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1016/j.websem.2016.03.002
Show More Cited By

Survey: An overview on XML similarity: Background, current trends and future directions
1. Applied computing
  1. Document management and text processing
2. Information systems

Recommendations

A Survey Study on XML Functional Dependencies
ISDPE '07: Proceedings of the The First International Symposium on Data, Privacy, and E-Commerce

There are two major kinds of XML functional dependency (FD) definitions. The first kind of XML FD includes Tree-tuple-based XML FD (tFD) and Path-based XML FD (pFD), and the second kind of XML FD includes Extended-path-based XML FD (epFD), Sub-graph-...
A survey of XQuery: an XML query language
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

Currently, the most effective standard used for data exchange over the Internet is the eXtensible Markup Language (XML). The greatest strength of XML is, it can represent many different kinds of information from diverse sources including structured and ...
A survey on XML streaming evaluation techniques

XML is currently the most popular format for exchanging and representing data on the web. It is used in various applications and for different types of data including structured, semistructured, and unstructured heterogeneous data types. During the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Science Review

Computer Science Review Volume 3, Issue 3

August, 2009

69 pages

ISSN:1574-0137

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2009.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 August 2009

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cui QJourdan GBochmann GOnut IFlood J(2018)Phishing Attacks Modifications and EvolutionsComputer Security10.1007/978-3-319-99073-6_12(243-262)Online publication date: 3-Sep-2018
https://dl.acm.org/doi/10.1007/978-3-319-99073-6_12
Tekli J(2016)An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.252576828:6(1383-1407)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1109/TKDE.2016.2525768
Tekli JCharbel NChbeir R(2016)Building semantic trees from XML documentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00237:C(1-24)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1016/j.websem.2016.03.002
Piernik MBrzezinski DMorzy T(2016)Clustering XML documents by patternsKnowledge and Information Systems10.1007/s10115-015-0820-046:1(185-212)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1007/s10115-015-0820-0
Tekli JChbeir RTraina ATraina CFileto R(2015)Approximate XML structure validation based on document-grammar tree similarityInformation Sciences: an International Journal10.1016/j.ins.2014.09.044295:C(258-302)Online publication date: 20-Feb-2015
https://dl.acm.org/doi/10.1016/j.ins.2014.09.044
Wang ZYe NMalekian RZhao TWang R(2014)Measuring the similarity of PML documents with RFID-based sensorsInternational Journal of Ad Hoc and Ubiquitous Computing10.1504/IJAHUC.2014.06576417:2/3(174-185)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1504/IJAHUC.2014.065764
Tan SLim TSoon LTang EZeng YKotoulas SHuang Z(2014)Learning to Match Heterogeneous Structures using Partially Labeled DataProceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning10.1145/2663792.2663797(45-48)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2663792.2663797
Ko SHan YSalomaa K(2014)Top-Down Tree Edit-Distance of Regular Tree LanguagesProceedings of the 8th International Conference on Language and Automata Theory and Applications - Volume 837010.1007/978-3-319-04921-2_38(466-477)Online publication date: 10-Mar-2014
https://dl.acm.org/doi/10.1007/978-3-319-04921-2_38
Auvattanasombat AWatanabe YYokota HWeippl EIndrawan-Santiago MSteinbauer MKotsis GKhalil I(2013)An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML DocumentsProceedings of International Conference on Information Integration and Web-based Applications & Services10.1145/2539150.2539260(518-522)Online publication date: 2-Dec-2013
https://dl.acm.org/doi/10.1145/2539150.2539260
Mota Mda Silva PViana SPrazeres CSampaio PSantanchè ASantos CGoularte R(2013)Similarity evaluation in XML schema and XLinkProceedings of the 19th Brazilian symposium on Multimedia and the web10.1145/2526188.2526222(153-156)Online publication date: 5-Nov-2013
https://dl.acm.org/doi/10.1145/2526188.2526222
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents