Abstract
Approximate data matching is a central problem in several data management processes, such as data integration, data cleaning, approximate queries, similarity search and so on. An approximate matching process aims at defining whether two data represent the same real-world object. For atomic values (strings, dates, etc), similarity functions have been defined for several value domains (person names, addresses, and so on). For matching aggregated values, such as relational tuples and XML trees, approaches alternate from the definition of simple functions that combine values of similarity of record attributes to sophisticated techniques based on machine learning, for example. For complex data comparison, including structured and semistructured documents, existing approaches use both structure and data for the comparison, by either considering or not considering data semantics. This survey presents terminology and concepts that base approximated data matching, as well as discusses related work on the use of similarity functions in such a subject.
Similar content being viewed by others
References
Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organization and algorithms (FODO). Springer, London, pp 69–84
Al-Khalifa S, Yu C, Jagadish HV (2003) Querying structured text in an xml database. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 4–15
Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: Proceedings of the 2008 IEEE 24th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 40–49
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases (VLDB). VLDB Endowment, pp 918–929
Aygun RS (2008) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3): 303–329
Barioni MC, Razente HL, Traina C Jr, Traina AJM (2005) Querying complex objects by similarity in sql. In: Proceedings of the 20th Brazilian symposium on databases (SBBD), pp 130–144
Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM). ACM, New York, pp 3–12
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5): 16–23
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 39–48
Bollacker K, Lawrence S, Giles CL (1998) CiteSeer: an autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the 2nd international conference on autonomous agents. ACM Press, New York, pp 116–123
Boser BE, Guyon IM, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory (COLT). ACM, New York, pp 144–152
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences (SEQUENCES). IEEE Computer Society, Washington, p 21
Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans Database Syst 27(2): 153–187
Bryan B, Schneider J, Nichol R, Miller C, Genovese C, Wasserman L (2006) Active learning for identifying function threshold boundaries. Advances in neural information processing systems (NIPS). MIT Press
Bueno R, Traina AJM, Traina C Jr (2005) Accelerating approximate similarity queries using genetic algorithms. In: Proceedings of the 2005 ACM symposium on applied computing (SAC), Santa Fe, New Mexico, USA, pp 617–622
Buttler D (2004) A short survey of document structure similarity algorithms. In: Proceedings of the international conference on internet computing, Las Vegas, Nevada, USA, pp 3–9
Carman MJ, Knoblock CA (2005) Inducing source descriptions for automated web service composition. In: Proceedings of the AAAI workshop on exploring planning and scheduling for web services, grid, and autonomic computing, Pittsburgh, Pennsylvania
Carvalho JCP, da Silva AS (2003) Finding similar identities among objects from multiple web sources. In: Proceedings of the 5th ACM international workshop on web information and data management (WIDM). ACM, New York, pp 90–93
Chapman S (2004) SimMetrics: a Java & C # .NET library of similarity metrics. http://sourceforge.net/projects/simmetrics/. Accessed 13 March 2009
Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases (VLDB), VLDB Endowment, pp 327–338
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 313–324
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 5
Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 865–876
Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Proceedings of the 25th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 397–410
Chawathe SS, Garcia-Molina H (1997) Meaningful change detection in structured data. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 26–37
Cheng T, Chang KC-C (2007) Entity search engine: towards agile best-effort information integration over the web. In: Third biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA, pp 108–113
Chuan Xiao, Wei Wang XL (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: Proceedings of the VLDB Endow. VLDB Endowment, pp 933–944
Cohen WW (1998a) Integration of heterogeneous databases without common domains using queries based on textual similarity, pp 201–212
Cohen WW (1998b) Providing database-like access to the web using queries based on textual similarity. SIGMOD Rec 27(2): 558–560
Cohen WW (1999) Recognizing structure in web pages using similarity queries. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence (AAAI/IAAI). American Association for Artificial Intelligence, Menlo Park, pp 59–66
Cohen WW (2000) Whirl: a word-based information representation language. Artif Intell 118(1–2): 163–196
Cohen WW, Ravikumar P, Fienberg S (2003a) Secondstring: open source java-based package of approximate string-matching. http://secondstring.sourceforge.net/. Accessed 20 Dec 2005
Cohen WW, Ravikumar P, Fienberg SE (2003b) A comparison of string distance metrics for name-matching tasks. In: Proceedings of workshop on information integration on the Web, IIWeb, pp 73–78
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 475–480
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6): 391–407
Dey D, Sarkar S (1996) A probabilistic relational model and algebra. ACM Trans Database Syst 21(3): 339–369
Diaconis P, Graham R (1997) Spearman’s footrule as a measure of disarray. J R Stat Soc 39(2): 262–268
Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst 18(5): 54–59
Dorneles CF, Heuser CA, da Silva AS, de Moura ES (2009) A generic strategy for combining similarity metrics in approximate matching. Inf Syst (Oxford) 34: 673–689
Dorneles CF, Heuser CA, Lima AEN, da Silva AS, de Moura ES (2004) Measuring similarity between collection of values. In: Proceedings of the 6th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 56–63
Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceedings of the 14th annual symposium combinatorial pattern matching (CPM), pp 83–95
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19: 1–16
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2): 160–175
Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of the 16th international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 312
Galhardas H, Florescu D, Shasha D, Simon E, Saita C-A (2001) Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 371–380
Gao L, Wang M, Wang XS, Padmanabhan S (2004) Expressing and optimizing similarity-based queries in sql. In: 23rd international conference on conceptual modeling (ER), Shanghai, China, pp 464–478
Getoor, L, Taskar, B (eds) (2007) Introduction to statistical relational learning. MIT Press, Cambridge
Giles CL, Bollacker K, Lawrence S (1998) CiteSeer: an automatic citation indexing system. In: Proceedings of the third ACM conference on digital libraries (DL). ACM, New York, pp 89–98
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 491–500
Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web (WWW). ACM, New York, pp 90–101
Guha S, Koudas N, Marathe A, Srivastava D (2004) Merging the results of approximate match operations. In: Proceedings of the thirtieth international conference on very large data bases (VLDB). VLDB Endowment, pp 636–647
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. SIGMOD Rec 24(2): 127–138
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37
Huffman S, Steier D (1995a) Heuristic joins to integrate structured heterogeneous data. In: Working notes of the AAAI spring symposium on information gathering from heterogeneous, distributed environment, pp 74–77
Huffman SB, Steier D (1995b) A navigation assistant for data source selection and integration, Working papers, Price Waterhouse
Ilyas F, Aref G, Elmagarmid K (2004) Supporting top-k join queries in relational databases. VLDB J 13(3): 754–765
Ilyas IF, Aref WG (2005) Rank-aware query processing and optimization. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 1144
Ilyas IF, Aref WG, Elmagarmid AK (2003) Supporting top-k join queries in relational databases. In: VLDB
Jin L, Li C, Mehrotra S (2002) Efficient similarity string joins in large data sets, technical report, University of California, Irvine. http://www.ics.uci.edu/chenli/pub/strjoin.ps
Kailing K, Kriegel H-P, Schönauer S, Seidl T (2004) Efficient similarity search for hierarchical data in large databases. In: 9th international conference on extending database technology (EDBT), Heraklion, Crete, Greece, pp 676–693
Knoblock CA, Ambite JL, Thakkar S (2003) A view integration approach to dynamic composition of web services. In: Proceedings of ICAPS workshop on planning for web services
Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), VLDB Endowment, pp 1078–1086
Laender AHF, Gonalves MA, Roberto PA (2004) Bdbcomp: building a digital library for the brazilian computer science community. In: ACM/IEEE joint conference on digital libraries (JCDL), Tuscon, AZ, USA, pp 23–24
Lee L (2001) On the effectiveness of the skew divergence for statistical language analysis. Artif Intell Stat, pp 65–72
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady, vol 10
Luis Leit a, Calado P, Weis M (2007) Structure-based inference of xml similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM). ACM, New York, pp 293–302
Ma Y, Chbeir R (2005) Content and structure based approach for xml similarity. In: Proceedings of the fifth international conference on computer and information technology (CIT). IEEE Computer Society, Washington, pp 136–140
Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 117–128
Michalowski M, Ambite JL, Knoblock CA, Minton S, Thakkar S, Tuchinda R (2004) Retrieving and semantically integrating heterogeneous data from the web. IEEE Intell Syst 19(3)
Michalowski M, Thakkar S, Knoblock CA (2005) Automatically utilizing secondary sources to align information across data sources. AI Mag 26(1): 72–79
Milano D, Scannapieco M, Catarci T (2006) Structure aware xml object identification. In: Proceedings of the first Int’l VLDB workshop on clean databases (CleanDB). Seoul, Korea
Mitchell TM (1997) Machine learning. McGraw-Hill, New York
Motro A (1988) Vague: a user interface to relational databases that permits vague queries. ACM Trans Off Inf Syst 6(3): 187–214
Navarro G (2001) A guided tour of approximate string matching. ACM Comput Surv 33(1): 31–88
Nayak R (2008) Fast and effective clustering of xml data using structural information. Knowl Inf Syst 14(2): 197–215
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: 5th international workshop on the web and databases (WebDB), Madison, Wisconsin, USA, pp 61–66
Park U, Seo Y (2005) An implementation of XML documents search system based on similarity in structure and semantics. In: International workshop on challenges in web information retrieval and integration
Puhlmann S, Weis M, Naumann F (2006) XML duplicate detection using sorted neighborhoods. In: Proceedings of the 10th international conference on extending database technology (EDBT), pp 773–791
Ragnemalm I (1993) The euclidean distance transform, Ph.D thesis, Department of Electrical Engineering, Linkpping University
Salton G, McGill M (1984) Introduction to modern information retrieval. McGraw-Hill, New York
Schallehn E, Sattler K-U, Saake G (2001) Advanced grouping and aggregation for data integration. In: CIKM ’01 proceedings of the tenth international conference on information and knowledge management. ACM, New York, pp 547–549
Schallehn E, Sattler K-U, Saake G (2004) Efficient similarity-based operations for data integration. Data Knowl Eng 48(3): 361–387
Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Proceedings of the twelfth international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 536–545
Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: Proceedings of the 20th national conference on artificial intelligence (AAAI). AAAI Press, pp 862–867
Stasiu RK, Heuser CA, Silva R (2005) Estimating recall and precision for vague queries in databases. In: Proceedings of the 17th international conference advanced information systems engineering (CAISE), pp 187–200
Takasu A, Fukagawa D, Akutsu T (2007) Statistical learning algorithm for tree similarity. In: Proceedings of the seventh IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, pp 667–672
Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8)
Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification, pp 350–359
Thor A, Rahm E (2007) Moma—a mapping-based object matching system. In: ‘Third biennial conference on innovative data systems research (CIDR). Asilomar, CA, USA, pp 247–258
Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for xml document clustering. In: Proceedings of the 7th Australasian data mining conference (AusDM), vol 87. ACS, Glenelg, pp 219–226
Vinson AR, Heuser CA, Silva AS, de Moura ES (2007) An approach to xml path matching. In: Proceedings of the 9th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 17–24
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73
Wang T-Y, Ré C, Suciu D (2008) Implementing not exists predicates over a probabilistic database. In: Proceedings of the international workshop on quality in databases and management of uncertain data, pp 73–86
Wang W, Xiao C, Lin X, Zhang C (2009) Efficient approximate entity extraction with edit distance constraints. In: Proceedings of the 35th SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 759–770
Weis M, Naumann F (2004) Detecting duplicate objects in xml documents. In: Proceedings of the 2004 international workshop on information quality in information systems (IQIS). ACM, New York, pp 10–19
Weis M, Naumann F (2005) Dogmatix tracks down duplicates in xml. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 431–442
Weis M, Naumann F (2006) Detecting duplicates in complex xml data. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 109
Widom J (2008) Managing and mining uncertain data. In: Trio: a system for data, uncertainty, and lineage. Springer, Berlin
Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: Proceedings of the IEEE international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 916–927
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceeding of the 17th international conference on World Wide Web (WWW). ACM, New York, pp 131–140
Yang J, Cheung WK, Chen X (2005) Integrating element and term semantics for similarity-based xml document clustering. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI). IEEE Computer Society, Washington, pp 222–228
Yongming G, Dehua C, Jiajin L (2008) Clustering xml documents by combining content and structure. In: Proceedings of the international symposium on information science and engieering. IEEE Computer Society, Los Alamitos, pp 583–587
Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. In: Proceedings of the 29th international conference on very large data bases (VLDB). VLDB Endowment, pp 117–128
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3): 321–334
Zhao H, Ram S (2005) Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf Syst 30(2): 119–132
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially done while the author C. F. Dorneles was a Ph.D. student at the Informatic Institute, UFRGS, and was supported by Capes.
Rights and permissions
About this article
Cite this article
Dorneles, C.F., Gonçalves, R. & dos Santos Mello, R. Approximate data instance matching: a survey. Knowl Inf Syst 27, 1–21 (2011). https://doi.org/10.1007/s10115-010-0285-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0285-0