More Web Proxy on the site http://driver.im/

survey

Blocking and Filtering Techniques for Entity Resolution: A Survey

Authors:

George Papadakis,

Dimitrios Skoutas,

Emmanouil Thanos,

Themis PalpanasAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 53, Issue 2

Article No.: 31, Pages 1 - 42

https://doi.org/10.1145/3377455

Published: 20 March 2020 Publication History

Abstract

Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous data. In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. The former restricts comparisons to entity pairs that are more likely to match, while the latter identifies quickly entity pairs that are likely to satisfy predetermined similarity thresholds. We also elaborate on hybrid approaches that combine different characteristics. For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field.

References

[1]

Noha Adly. 2009. Efficient record linkage using a double embedding scheme. In Proceedings of the 2009 International Conference on Data Mining (DMIN'09). 274--281.

[2]

Foto Afrati, Anish Das Sarma, David Menestrina, Aditya Parameswaran, and Jeffrey Ullman. 2012. Fuzzy joins using Mapreduce. In ICDE. 498--509.

[3]

Akiko N. Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In Proceedings of the 2005 International Workshop on Challenges in Web Information Retrieval and Integration (WIRI'05). 30--39.

[4]

Amin Allam, Spiros Skiadopoulos, and Panos Kalnis. 2018. Improved suffix blocking for record linkage and entity resolution. DKE 117 (2018), 98--113.

[5]

Yasser Altowim, Dmitri V. Kalashnikov, and Sharad Mehrotra. 2014. Progressive approach to relational entity resolution. PVLDB 7, 11 (2014), 999--1010.

Digital Library

[6]

Yasser Altowim and Sharad Mehrotra. 2017. Parallel progressive approach to entity resolution using Mapreduce. In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE'17). 909--920.

[7]

Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact set-similarity joins. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 918--929.

Digital Library

[8]

Samur Araújo, Duc Thanh Tran, Arjen P. de Vries, and Daniel Schwabe. 2015. SERIMI: Class-based matching for instance matching across heterogeneous datasets. IEEE TKDE 27, 5 (2015), 1397--1410.

[9]

Tiago Brasileiro Araújo, Carlos Eduardo Santos Pires, and Thiago Pereira da Nóbrega. 2017. Spark-based streamlined metablocking. In Proceedings of the 2017 IEEE Symposium on Computers and Communications (ISCC'17). 844--850.

[10]

Nikolaus Augsten and Michael H. Böhlen. 2013. Similarity Joins in Relational Database Systems. Morgan 8 Claypool Publishers.

[11]

Rohan Baxter, Peter Christen, and Tim Churches. 2003. A comparison of fast blocking methods for record linkage. In Proceedings of the Workshop on Data Cleaning, Record Linkage and Object Consolidation at the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[12]

Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). 131--140.

Digital Library

[13]

Alexandros Belesiotis, Dimitrios Skoutas, Christodoulos Efstathiades, Vassilis Kaffes, and Dieter Pfoser. 2018. Spatio-textual user matching and clustering based on set similarity joins. VLDB J. 27, 3 (2018), 297--320.

Digital Library

[14]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.

Digital Library

[15]

Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75 (2018), 75--89.

[16]

Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06). 87--96.

[17]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). 39--48.

[18]

T. Bocek, E. Hunt, and B. Stiller. 2007. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02. Department of Informatics, University of Zurich. http://fastss.csg.uzh.ch/.

[19]

Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatio-textual similarity joins. PVLDB 6, 1 (2012), 1--12.

Digital Library

[20]

Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of SEQUENCES (SEQUENCES'97). 21--29.

Digital Library

[21]

Yunbo Cao, Zhiyuan Chen, Jiamin Zhu, Pei Yue, Chin-Yew Lin, and Yong Yu. 2011. Leveraging unlabeled data to scale blocking for record linkage. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11). 2211--2217.

[22]

Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE'06). 5.

Digital Library

[23]

Peter Christen. 2008. Febrl-: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). 1065--1068.

Digital Library

[24]

Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

[25]

Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24, 9 (2012), 1537--1555.

[26]

Peter Christen, Ross W. Gayler, and David Hawking. 2009. Similarity-aware indexing for real-time entity resolution. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). 1565--1568.

Digital Library

[27]

Tobias Christiani and Rasmus Pagh. 2017. Set similarity search beyond MinHash. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC'17). 1094--1107.

Digital Library

[28]

Tobias Christiani, Rasmus Pagh, and Johan Sivertsen. 2018. Scalable and robust set similarity join. In Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE'18). 1240--1243.

[29]

Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity Resolution in the Web of Data. Morgan 8 Claypool Publishers.

[30]

Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed data deduplication. PVLDB 9, 11 (2016), 864--875.

Digital Library

[31]

Aaron Clauset, Mark E. J. Newman, and Cristopher Moore. 2004. Finding community structure in very large networks. Phys. Rev. E 70, 6 (2004), 066111.

[32]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry (SOCG'04). 253--262.

[33]

Timothy de Vries, Hui Ke, Sanjay Chawla, and Peter Christen. 2009. Robust record linkage blocking using suffix arrays. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). 305--314.

Digital Library

[34]

Timothy de Vries, Hui Ke, Sanjay Chawla, and Peter Christen. 2011. Robust record linkage blocking using suffix arrays and Bloom filters. TKDD 5, 2 (2011), 9:1--9:27.

[35]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[36]

Dong Deng, Albert Kim, Samuel Madden, and Michael Stonebraker. 2017. SilkMoth: An efficient method for finding related sets with maximum matching constraints. PVLDB 10, 10 (2017), 1082--1093.

Digital Library

[37]

Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. PVLDB 9, 4 (2015), 360--371.

Digital Library

[38]

Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan 8 Claypool Publishers.

[39]

Uwe Draisbach and Felix Naumann. 2010. DuDe: The duplicate detection toolkit. In Proceedings of the 8th International Workshop on Quality in Databases (QDB'10).

[40]

Uwe Draisbach and Felix Naumann. 2011. A generalization of blocking and windowing algorithms for duplicate detection. In Proceedings of the 2011 International Conference on Data and Knowledge Engineering (ICDKE'11). 18--24.

[41]

Uwe Draisbach, Felix Naumann, Sascha Szott, and Oliver Wonneberg. 2012. Adaptive windows for duplicate detection. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE'12). 1073--1083.

Digital Library

[42]

Songyun Duan, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, and Michael J. Ward. 2012. Instance-based matching of large ontologies using locality-sensitive hashing. In Proceedings of the 11th International Semantic Web Conference (ISWC'12). 49--64.

[43]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454--1467.

[44]

Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2015. Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data. In IEEE Big Data. 411--420.

[45]

Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65 (2017), 137--157.

Digital Library

[46]

Vasilis Efthymiou, Kostas Stefanidis, and Vassilis Christophides. 2015. Big data entity resolution: From highly to somehow similar entity descriptions in the web. In IEEE Big Data. 401--410.

[47]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE TKDE 19, 1 (2007), 1--16.

[48]

Luiz Evangelista, Eli Cortez, Altigran da Silva, and Wagner Meira Jr. 2010. Adaptive and flexible blocking for record linkage tasks. JIDM 1, 2 (2010), 167--182.

[49]

Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.

[50]

Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set similarity joins on Mapreduce: An experimental survey. PVLDB 11, 10 (2018), 1110--1122.

Digital Library

[51]

Jeffrey Fisher, Peter Christen, Qing Wang, and Erhard Rahm. 2015. A clustering-based framework to control block sizes for entity resolution. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15). 279--288.

Digital Library

[52]

Dengfeng Gao. 2009. Temporal joins. In Encyclopedia of Database Systems. 2982--2987.

[53]

Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: Theory, practice 8 open challenges. PVLDB 5, 12 (2012), 2018--2019.

Digital Library

[54]

Phan Giang. 2015. A machine learning approach to create blocking criteria for record linkage. Health Care Manag. Sci. 18, 1 (2015), 93--105.

[55]

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB'99). 518--529.

Digital Library

[56]

Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-Chiew Tan. 2017. Data integration: After the teenage years. In ACM PODS. 101--106.

[57]

Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. 2001. Approximate string joins in a database (Almost) for free. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 491--500.

Digital Library

[58]

Thomas Gschwind, Christoph Miksovic, Katsiaryna Mirylenka, and Paolo Scotton. 2019. Fast record linkage for company entities. CoRR abs/1907.08667 (2019).

[59]

Lifang Gu and Rohan A. Baxter. 2004. Adaptive filtering for efficient record linkage. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM'04). 477--481.

[60]

Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD'95). 127--138.

[61]

Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2, 1 (1998), 9--37.

Digital Library

[62]

Robert Isele, Anja Jentzsch, and Christian Bizer. 2011. Efficient multidimensional blocking for link discovery without losing recall. In Proceedings of the 14th International Workshop on the Web and Database (WebDB'11).

[63]

Edwin H. Jacox and Hanan Samet. 2007. Spatial join techniques. ACM TODS 32, 1 (2007), 7.

Digital Library

[64]

Yu Jiang, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2014. String similarity joins: An experimental evaluation. PVLDB 7, 8 (2014), 625--636.

Digital Library

[65]

Liang Jin, Chen Li, and Sharad Mehrotra. 2003. Efficient record linkage in large data sets. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'03). 137--146.

[66]

Pawel Jurczyk, James J. Lu, Li Xiong, Janet D. Cragan, and Adolfo Correa. 2008. Fine-grained record integration and linkage tool. Birth Defects Res. A: Clin. Mol. Teratol. 82, 11 (2008), 822--829.

[67]

Murat Kantarcioglu, Ali Inan, Wei Jiang, and Bradley Malin. 2009. Formal anonymity models for efficient privacy-preserving joins. DKE 68, 11 (2009), 1206--1223.

Digital Library

[68]

Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2018. Fast schemes for online record linkage. Data Min. Knowl. Discov. 32, 5 (2018), 1229--1250.

Digital Library

[69]

Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2018. Summarization algorithms for record linkage. In Proceedings of the 21th International Conference on Extending Database Technology (EDBT'18). 73--84.

[70]

Dimitrios Karapiperis, Dinusha Vatsalan, Vassilios S. Verykios, and Peter Christen. 2016. Efficient record linkage using a compact hamming space. In Proceedings of the 19th International Conference on Extending Database Technology (EDBT'16). 209--220.

[71]

Dimitrios Karapiperis and Vassilios S. Verykios. 2016. A fast and efficient hamming LSH-based scheme for accurate linkage. KAIS 49, 3 (2016), 861--884.

Digital Library

[72]

Mayank Kejriwal and Daniel P. Miranker. 2013. An unsupervised algorithm for learning blocking schemes. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM'13). 340--349.

[73]

Mayank Kejriwal and Daniel P. Miranker. 2014. A two-step blocking scheme learner for scalable link discovery. In Proceedings of the 9th International Workshop on Ontology Matching (OM'14). 49--60.

[74]

Mayank Kejriwal and Daniel P. Miranker. 2015. A DNF blocking scheme learner for heterogeneous datasets. CoRR abs/1501.01694 (2015).

[75]

Batya Kenig and Avigdor Gal. 2013. MFIBlocks: An effective blocking algorithm for entity resolution. Inf. Syst. 38, 6 (2013), 908--926.

Digital Library

[76]

Hung-sik Kim and Dongwon Lee. 2010. HARRA: Fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 525--536.

[77]

Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Dedoop: Efficient deduplication with Hadoop. PVLDB 5, 12 (2012), 1878--1881.

Digital Library

[78]

Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Load balancing for Mapreduce-based entity resolution. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE'12). 618--629.

Digital Library

[79]

Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Multi-pass sorted neighborhood blocking with MapReduce. Comput. Sci. R8D 27, 1 (2012), 45--63.

Digital Library

[80]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward building entity matching management systems. PVLDB 9, 12 (2016), 1197--1208.

Digital Library

[81]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2009. Comparative evaluation of entity resolution approaches with FEVER. PVLDB 2, 2 (2009), 1574--1577.

Digital Library

[82]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1 (2010), 484--493.

Digital Library

[83]

Chen Li, Jiaheng Lu, and Yiming Lu. 2008. Efficient merging and filtering algorithms for approximate string searches. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 257--266.

Digital Library

[84]

Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. 2011. PASS-JOIN: A partition-based method for similarity joins. PVLDB 5, 3 (2011), 253--264.

Digital Library

[85]

Guoliang Li, Jian He, Dong Deng, and Jian Li. 2015. Efficient similarity join and search on multi-attribute data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD'15). 1137--1151.

Digital Library

[86]

Han Li, Pradap Konda, Paul Suganthan, AnHai Doan, Benjamin Snyder, Youngchoon Park, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2018. MatchCatcher: A debugger for blocking in entity matching. In Proceedings of the 21th International Conference on Extending Database Technology (EDBT'18). 193--204.

[87]

Yaping Li and Minghua Chen. 2008. Privacy preserving joins. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 1352--1354.

Digital Library

[88]

Huizhi Liang, Yanzhe Wang, Peter Christen, and Ross W. Gayler. 2014. Noise-tolerant approximate blocking for dynamic real-time entity resolution. In Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'14). 449--460.

[89]

Wei Lu, Xiaoyong Du, Marios Hadjieleftheriou, and Beng Chin Ooi. 2014. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE TKDE 26, 12 (2014), 2983--2996.

[90]

Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 950--961.

[91]

Kun Ma, Fusen Dong, and Bo Yang. 2015. Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. Comput. J. 58, 11 (2015), 3187--3201.

[92]

Yongtao Ma and Thanh Tran. 2013. TYPiMatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM'13). 325--334.

Digital Library

[93]

Pankaj Malhotra, Puneet Agarwal, and Gautam Shroff. 2014. Graph-parallel entity resolution using LSH 8 IMM. In Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference. 41--49.

[94]

Willi Mann and Nikolaus Augsten. 2014. PEL: Position-enhanced length filter for set similarity joins. In Grundlagen Datenbanken. 89--94.

[95]

Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An empirical evaluation of set similarity join techniques. PVLDB 9, 9 (2016), 636--647.

Digital Library

[96]

Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016. Pay-as-you-go configuration of entity resolution. T-LSD-KCS 29 (2016), 40--65.

[97]

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference. 169--178.

[98]

W. P. McNeill, Hakan Kardes, and Andrew Borthwick. 2012. Dynamic record blocking: Efficient linking of massive databases in Mapreduce. In Proceedings of the 10th International Workshop on Quality in Databases (QDB'12).

[99]

Demetrio Gomes Mestre, Carlos Eduardo S. Pires, and Dimas C. Nascimento. 2015. Adaptive sorted neighborhood blocking for entity matching with Mapreduce. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (SAC'15). 981--987.

[100]

Matthew Michelson and Craig A. Knoblock. 2006. Learning blocking schemes for record linkage. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06). 440--445.

[101]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS'13). 3111--3119.

Digital Library

[102]

Dimas Cassimiro Nascimento, Carlos Eduardo Santos Pires, and Demetrio Gomes Mestre. 2020. Exploiting block co-occurrence to control block sizes for entity resolution. Knowl. Inf. Syst. 62, 1 (2020), 359--400.

[103]

Sahand Negahban, Benjamin I. P. Rubinstein, and Jim Gemmell. 2012. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM'12). 2224--2228.

Digital Library

[104]

E. D. Nelson and J. R. Talburt. 2011. Entity resolution for longitudinal studies in education using OYSTER. In Proceedings of the 2011 International Conference on Information and Knowledge Engineering (IKE'11).

[105]

Markus Nentwig, Michael Hartung, Axel Ngomo, and Erhard Rahm. 2017. A survey of current link discovery frameworks. Semantic Web 8, 3 (2017), 419--436.

Digital Library

[106]

Axel Ngomo. 2013. ORCHID - Reduction-ratio-optimal computation of geo-spatial distances for link discovery. In Proceedings of the 12th International Semantic Web Conference (ISWC'12). 395--410.

Digital Library

[107]

Axel Ngomo and Sören Auer. 2011. LIMES - A time-efficient approach for large-scale link discovery on the web of data. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11). 2312--2317.

[108]

Andriy Nikolov, Victoria Uren, and Enrico Motta. 2007. KnoFuss: A comprehensive architecture for knowledge fusion. In Proceedings of the 4th International Conference on Knowledge Capture (K-CAP'07). 185--186.

Digital Library

[109]

Jordi Nin, Victor Muntés-Mulero, Norbert Martínez-Bazan, and Josep-Lluís Larriba-Pey. 2007. On the use of semantic blocking techniques for data cleansing and integration. In Proceedings of the 11th International Database Engineering and Applications Symposium (IDEAS'07). 190--198.

Digital Library

[110]

Kevin O’Hare, Anna Jurek, and Cassio de Campos. 2018. A new technique of selecting an optimal blocking method for better record linkage. Inf. Syst. 77 (2018), 151--166.

[111]

Kevin O’Hare, Anna Jurek-Loughrey, and Cassio de Campos. 2019. A review of unsupervised and semi-supervised blocking methods for record linkage. In Linking and Mining Heterogeneous and Multi-view Data. 79--105.

[112]

George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. 2015. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB 9, 4 (2015), 312--323.

Digital Library

[113]

George Papadakis, Konstantina Bereta, Themis Palpanas, and Manolis Koubarakis. 2017. Multi-core meta-blocking for big linked data. In Proceedings of the 13th International Conference on Semantic Systems (SEMANTICS'17). 33--40.

Digital Library

[114]

George Papadakis, Gianluca Demartini, Peter Fankhauser, and Philipp Kärger. 2010. The missing links: Discovering hidden same-as links among a billion of triples. In Proceedings of the 12th International Conference on Information Integration and Web-based Applications and Services (iiWAS'10). 453--460.

Digital Library

[115]

George Papadakis, George Giannakopoulos, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. Detecting and exploiting stability in evolving heterogeneous information spaces. In Proceedings of the 2011 Joint International Conference on Digital Libraries (JCDL'11). 95--104.

Digital Library

[116]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser. 2011. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM'11). 535--544.

Digital Library

[117]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 2011 Joint International Conference on Digital Libraries (JCDL'11). 85--94.

Digital Library

[118]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. To compare or not to compare: Making entity resolution more efficient. In Proceedings of the International Workshop on Semantic Web Information Management (SWIM'11). 3.

Digital Library

[119]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2012. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In Proceedings of the 5th International Conference on Web Search and Web Data Mining (WSDM'12). 53--62.

Digital Library

[120]

George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE 25, 12 (2013), 2665--2682.

[121]

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. IEEE TKDE 26, 8 (2014), 1946--1960.

[122]

George Papadakis and Wolfgang Nejdl. 2011. Efficient entity resolution methods for heterogeneous information spaces. In Workshops Proceedings of the 27th IEEE International Conference on Data Engineering. 304--307.

Digital Library

[123]

George Papadakis and Themis Palpanas. 2016. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples. In Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE'16). 1436--1439.

[124]

George Papadakis and Themis Palpanas. 2018. Web-scale, schema-agnostic, end-to-end entity resolution. In Companion Volume of The Web Conference 2018 (WWW'18).

[125]

George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Supervised meta-blocking. PVLDB 7, 14 (2014), 1929--1940.

Digital Library

[126]

George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Boosting the efficiency of large-scale entity resolution with enhanced meta-blocking. Big Data Res. 6 (2016), 43--63.

[127]

George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 19th International Conference on Extending Database Technology (EDBT'16). 221--232.

[128]

George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9, 9 (2016), 684--695.

Digital Library

[129]

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2018. The return of JedAI: End-to-end entity resolution for structured and semi-structured data. PVLDB 11, 12 (2018), 1950--1953.

Digital Library

[130]

Thorsten Papenbrock, Arvid Heise, and Felix Naumann. 2015. Progressive duplicate detection. IEEE TKDE 27, 5 (2015), 1316--1329.

[131]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP'14). 1532--1543.

[132]

Sven Puhlmann, Melanie Weis, and Felix Naumann. 2006. XML Duplicate detection using sorted neighborhoods. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT'06). 773--791.

Digital Library

[133]

Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, and Xuemin Lin. 2011. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1033--1044.

Digital Library

[134]

Jianbin Qin and Chuan Xiao. 2018. Pigeonring: A principle for faster thresholded similarity search. PVLDB 12, 1 (2018), 28--42.

Digital Library

[135]

Banda Ramadan and Peter Christen. 2014. Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM'14). 1787--1790.

Digital Library

[136]

Banda Ramadan and Peter Christen. 2015. Unsupervised blocking key selection for real-time entity resolution. In Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'15). 574--585.

[137]

Banda Ramadan, Peter Christen, Huizhi Liang, and Ross W. Gayler. 2015. Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Inf. Qual. 6, 4, Article 15 (2015), 15:1--15:29 pages.

[138]

Banda Ramadan, Peter Christen, Huizhi Liang, Ross W. Gayler, and David Hawking. 2013. Dynamic similarity-aware inverted indexing for real-time entity resolution. In International Workshops of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 47--58.

Digital Library

[139]

Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen. 2016. Scalable block scheduling for efficient multi-database record linkage. In Proceedings of the 16th IEEE International Conference on Data Mining (ICDM'16). 1161--1166.

[140]

Leonardo Andrade Ribeiro and Theo Härder. 2011. Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36, 1 (2011), 62--78.

Digital Library

[141]

Stephen V. Rice. 2007. Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In Western MultiConference on Computer Simulation. 150--155.

[142]

Chuitian Rong, Chunbin Lin, Yasin N. Silva, Jianguo Wang, Wei Lu, and Xiaoyong Du. 2017. Fast and scalable distributed set similarity joins for big data analytics. In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE'17). 1059--1070.

[143]

Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen, and Anthony K. H. Tung. 2013. Efficient and scalable processing of string similarity join. IEEE TKDE 25, 10 (2013), 2217--2230.

[144]

Sunita Sarawagi and Alok Kirpal. 2004. Efficient set joins on similarity predicates. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 743--754.

Digital Library

[145]

Murat Sariyar, Andreas Borg, and Klaus Pommerening. 2011. Controlling false match rates in record linkage using extreme value theory. J. Biomed. Inf. 44, 4 (2011), 648--654.

Digital Library

[146]

Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM'12). 1055--1064.

[147]

Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. PVLDB 5, 5 (2012), 430--441.

Digital Library

[148]

Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE TKDE 27, 2 (2015), 443--460.

[149]

Liangcai Shu, Aiyou Chen, Ming Xiong, and Weiyi Meng. 2011. Efficient spectral neighborhood blocking for entity resolution. In Proceedings of the 27th International Conference on Data Engineering (ICDE'11). 1067--1078.

Digital Library

[150]

Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9, 12 (2016), 1173--1184.

Digital Library

[151]

Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83 (2019), 145--165.

Digital Library

[152]

Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2019. Schema-agnostic progressive entity resolution. IEEE TKDE 31, 6 (2019), 1208--1221.

[153]

Dezhao Song. 2012. Scalable and domain-independent entity coreference: Establishing high quality data linkages across heterogeneous data sources. In Proceedings of the 11th International Semantic Web Conference (ISWC'12). 424--432.

Digital Library

[154]

Dezhao Song and Jeff Heflin. 2011. Automatically generating data linkages using a domain-independent candidate selection approach. In Proceedings of the 10th International Semantic Web Conference (ISWC'11). 649--664.

[155]

Dezhao Song, Yi Luo, and Jeff Heflin. 2017. Linking heterogeneous data in the semantic web using scalable and domain-independent candidate selection. IEEE TKDE 29, 1 (2017), 143--156.

[156]

Kostas Stefanidis, Vassilis Christophides, and Vasilis Efthymiou. 2017. Web-scale blocking, iterative and progressive entity resolution. In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE'17). 1459--1462.

[157]

Kostas Stefanidis, Vasilis Efthymiou, Melanie Herschel, and Vassilis Christophides. 2014. Entity resolution in the web of data. In Companion Volume of the 23rd International World Wide Web Conference (WWW'14). 203--204.

Digital Library

[158]

Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In Privacy in Statistical Databases. 253--268.

[159]

Paul Suganthan, Adel Ardalan, AnHai Doan, and Aditya Akella. 2018. Smurf: Self-service string matching using random forests. PVLDB 12, 3 (2018), 278--291.

[160]

Ji Sun, Zeyuan Shang, Guoliang Li, Zhifeng Bao, and Dong Deng. 2019. Balance-aware distributed string similarity-based query processing system. PVLDB 12, 9 (2019), 961--974.

Digital Library

[161]

Ji Sun, Zeyuan Shang, Guoliang Li, Dong Deng, and Zhifeng Bao. 2017. Dima: A distributed in-memory similarity-based query processing system. PVLDB 10, 12 (2017), 1925--1928.

Digital Library

[162]

Wenbo Tao, Dong Deng, and Michael Stonebraker. 2017. Approximate string joins with abbreviations. PVLDB 11, 1 (2017), 53--65.

Digital Library

[163]

Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 563--576.

Digital Library

[164]

Saravanan Thirumuruganathan, Shameem Ahamed Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq R. Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. CoRR abs/1809.11084 (2018).

[165]

Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios. 2013. A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 38, 6 (2013), 946--969.

Digital Library

[166]

Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 495--506.

Digital Library

[167]

Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. 2009. Silk-a link discovery framework for the web of data. In Proceedings of the WWW2009 Workshop on Linked Data on the Web (LDOW'09). 53 pages.

[168]

Jiannan Wang, Guoliang Li, and Jianhua Feng. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3, 1 (2010), 1219--1230.

Digital Library

[169]

Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 85--96.

Digital Library

[170]

Jiannan Wang, Guoliang Li, and Jianhua Feng. 2014. Extending string similarity join to tolerant fuzzy token matching. ACM TODS 39, 1 (2014), 7:1--7:45.

[171]

Jin Wang, Chunbin Lin, and Carlo Zaniolo. 2019. MF-Join: Efficient fuzzy string similarity join with multi-level filtering. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE'19). 386--397.

[172]

Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. CoRR abs/1408.2927 (2014).

[173]

Jiaying Wang, Xiaochun Yang, Bin Wang, and Chengfei Liu. 2017. LS-Join: Local similarity join on string collections. IEEE TKDE 29, 9 (2017), 1928--1942.

[174]

Pei Wang, Chuan Xiao, Jianbin Qin, Wei Wang, Xiaoyang Zhang, and Yoshiharu Ishikawa. 2016. Local similarity search for unstructured text. In Proceedings of the 2016 ACM International Conference on Management of Data (SIGMOD'16). 1991--2005.

Digital Library

[175]

Qing Wang, Mingyuan Cui, and Huizhi Liang. 2016. Semantic-aware blocking for entity resolution. IEEE TKDE 28, 1 (2016), 166--180.

[176]

Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, and Heng Tao Shen. 2013. VChunkJoin: An efficient algorithm for edit similarity joins. IEEE TKDE 25, 8 (2013), 1916--1929.

[177]

Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. 2017. Leveraging set relations in exact set similarity join. PVLDB 10, 9 (2017), 925--936.

Digital Library

[178]

Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-as-you-go entity resolution. IEEE TKDE 25, 5 (2013), 1111--1124.

[179]

Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. 2009. Entity resolution with iterative blocking. In SIGMOD. 219--232.

[180]

Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. PVLDB 1, 1 (2008), 933--944.

Digital Library

[181]

Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. 2009. Top-k set similarity joins. In ICDE. 916--927.

[182]

Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. 2008. Efficient similarity joins for near duplicate detection. In WWW. 131--140.

[183]

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM TODS 36, 3 (2011), 15:1--15:41.

[184]

Pengfei Xu and Jiaheng Lu. 2019. Towards a unified framework for string similarity joins. PVLDB 12, 11 (2019), 1289--1302.

Digital Library

[185]

Su Yan, Dongwon Lee, Min-Yen Kan, and C. Lee Giles. 2007. Adaptive sorted neighborhood methods for efficient record linkage. In JCDL. 185--194.

[186]

Wei Yan, Yuan Xue, and Bradley Malin. 2013. Scalable load balancing for Mapreduce-based record linkage. In IPCCC. 1--10.

[187]

Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. 2016. String similarity search and join: A survey. Frontiers Comput. Sci. 10, 3 (2016), 399--417.

Digital Library

[188]

Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. 2017. A unified framework for string similarity search with edit-distance constraint. VLDB J. 26, 2 (2017), 249--274.

Digital Library

[189]

Xingliang Yuan, Xinyu Wang, Cong Wang, Chenyun Yu, and Sarana Nutanong. 2017. Privacy-preserving similarity joins over encrypted data. IEEE TIFS 12, 11 (2017), 2763--2775.

[190]

Jiaqi Zhai, Yin Lou, and Johannes Gehrke. 2011. ATLAS: A probabilistic algorithm for high dimensional similarity search. In SIGMOD. 997--1008.

[191]

Fulin Zhang, Zhipeng Gao, and Kun Niu. 2017. A pruning algorithm for meta-blocking based on cumulative weight. J. Phys. Conf. Ser. 887, 1 (2017), 012058. https://iopscience.iop.org/article/10.1088/1742-6596/887/1/012058.

[192]

Yong Zhang, Xiuxing Li, Jin Wang, Ying Zhang, Chunxiao Xing, and Xiaojie Yuan. 2017. An efficient framework for exact set similarity search using tree structure indexes. In ICDE. 759--770.

[193]

Yong Zhang, Jiacheng Wu, Jin Wang, and Chunxiao Xing. 2020. A transformation-based framework for knn set similarity search. IEEE TKDE 32, 3 (2020), 409--423.

[194]

Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava. 2010. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In SIGMOD. 915--926.

[195]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap set similarity search for finding joinable tables in data lakes. In SIGMOD. 847--864.

Cited By

Araújo TEfthymiou VChristophides VPitoura EStefanidis K(2025)TREATS: Fairness-aware entity resolution over streaming dataInformation Systems10.1016/j.is.2024.102506129(102506)Online publication date: Mar-2025
https://doi.org/10.1016/j.is.2024.102506
Han SWang ZShen DWang C(2024)A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium BlockchainMathematics10.3390/math1212185412:12(1854)Online publication date: 14-Jun-2024
https://doi.org/10.3390/math12121854
Hagan NTalburt JAnderson KHagan D(2024)A scalable MapReduce-based design of an unsupervised entity resolution systemFrontiers in Big Data10.3389/fdata.2024.12965527Online publication date: 1-Mar-2024
https://doi.org/10.3389/fdata.2024.1296552
Show More Cited By

Index Terms

Blocking and Filtering Techniques for Entity Resolution: A Survey
1. General and reference
  1. Document types
    1. Surveys and overviews
2. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution
Advances in Databases and Information Systems
Abstract
Entity Resolution (ER) is an essential task in the data integration process, by identifying records that refer to the same object in the real world. In a naive approach, ER needs to compare all pairs of records in a dataset. This process has a ...
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 53, Issue 2

March 2021

848 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3388460

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 March 2020

Accepted: 01 December 2019

Revised: 01 October 2019

Received: 01 February 2019

Published in CSUR Volume 53, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Survey
Refereed

Funding Sources

EU H2020 projects ExtremeEarth
SmartDataLake

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

112
Total Citations
View Citations
2,003
Total Downloads

Downloads (Last 12 months)276
Downloads (Last 6 weeks)41

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Araújo TEfthymiou VChristophides VPitoura EStefanidis K(2025)TREATS: Fairness-aware entity resolution over streaming dataInformation Systems10.1016/j.is.2024.102506129(102506)Online publication date: Mar-2025
https://doi.org/10.1016/j.is.2024.102506
Han SWang ZShen DWang C(2024)A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium BlockchainMathematics10.3390/math1212185412:12(1854)Online publication date: 14-Jun-2024
https://doi.org/10.3390/math12121854
Hagan NTalburt JAnderson KHagan D(2024)A scalable MapReduce-based design of an unsupervised entity resolution systemFrontiers in Big Data10.3389/fdata.2024.12965527Online publication date: 1-Mar-2024
https://doi.org/10.3389/fdata.2024.1296552
Ioannou ENikoletos KPapadakis G(2024) py J ed AI: A Library with Resolution-Related Structures and Procedures for Products INFORMS Journal on Computing10.1287/ijoc.2023.0410Online publication date: 9-Sep-2024
https://doi.org/10.1287/ijoc.2023.0410
Huang Z(2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669844
Backes TDietze S(2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3646553
Yan MWang YPang KXie MLi JBaeza-Yates RBonchi F(2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671873
Xiong KXu XFu SWeng DWang YWu Y(2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
https://doi.org/10.1109/TVCG.2024.3388556
Fan MHan XFan JChai CTang NLi GDu X(2024)Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00284(3696-3709)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00284
Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents