[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An Overview of End-to-End Entity Resolution for Big Data

Published: 06 December 2020 Publication History

Abstract

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness, extreme diversity, high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.

References

[1]
Akiko N. Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In WIRI. 30--39.
[2]
Yasser Altowim, Dmitri V. Kalashnikov, and Sharad Mehrotra. 2014. Progressive approach to relational entity resolution. PVLDB 7, 11 (2014), 999--1010.
[3]
Yasser Altowim and Sharad Mehrotra. 2017. Parallel progressive approach to entity resolution using MapReduce. In ICDE. 909--920.
[4]
Hotham Altwaijry, Dmitri V. Kalashnikov, and Sharad Mehrotra. 2017. QDA: A query-driven approach to entity resolution. TKDE 29, 2 (2017), 402--417.
[5]
Hotham Altwaijry, Sharad Mehrotra, and Dmitri V. Kalashnikov. 2015. QuERy: A framework for integrating entity resolution with query processing. PVLDB 9, 3 (2015), 120--131.
[6]
Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. 2002. Eliminating fuzzy duplicates in data warehouses. In VLDB. 586--597.
[7]
Rico Angell, Brittany Johnson, Yuriy Brun, and Alexandra Meliou. 2018. Themis: Automatically testing software for discrimination. In the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 871--875.
[8]
Samur Araújo, Duc Thanh Tran, Arjen P. de Vries, and Daniel Schwabe. 2015. SERIMI: Class-based matching for instance matching across heterogeneous datasets. TKDE 27, 5 (2015), 1397--1410.
[9]
Tiago Brasileiro Araújo, Kostas Stefanidis, Carlos Eduardo Santos Pires, Jyrki Nummenmaa, and Thiago Pereira da Nóbrega. 2020. Schema-agnostic blocking for streaming data. In Proceedings of the 35th ACM/SIGAPP Symposium on Applied Computing (SAC’20). 412--419.
[10]
Javed A. Aslam, Ekaterina Pelekhov, and Daniela Rus. 2004. The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8 (2004), 95--129.
[11]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Challenges and applications in multimodal machine learning. In The Handbook of Multimodal-Multisensor Interfaces. ACM and Morgan 8 Claypool, 17--48.
[12]
Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation clustering. Machine Learning 56, 1--3 (2004), 89--113.
[13]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. JMLR 3 (2003), 1137--1155.
[14]
Omar Benjelloun, Hector Garcia-Molina, Heng Gong, Hideki Kawai, Tait Eliott Larson, David Menestrina, and Sutthipong Thavisomboon. 2007. D-Swoosh: A family of algorithms for generic, distributed entity resolution. In IEEE Conference on Distributed Computing Systems (ICDCS’07). 37.
[15]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.
[16]
Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDD 1, 1 (2007), 5.
[17]
Indrajit Bhattacharya and Lise Getoor. 2007. Query-time entity resolution. J. Artif. Intell. Res. 30 (2007), 621--657.
[18]
Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75 (2018), 75--89.
[19]
Mikhail Bilenko, Sugato Basu, and Mehran Sahami. 2005. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). 58--65.
[20]
M. Bilenko and R. J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD.
[21]
Christoph Böhm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: Distributed web-of-data-scale entity matching. In CIKM.
[22]
Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures—A step forward in data integration. In Proceedings of the 23nd International Conference on Extending Database Technology (EDBT’20). 463--473.
[23]
Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2016. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD.
[24]
Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2018. A partial-order-based framework for cost-effective crowdsourced entity resolution. VLDB J. 27, 6 (2018), 745--770.
[25]
Moses Charikar, Chandra Chekuri, Tomás Feder, and Rajeev Motwani. 2004. Incremental clustering and dynamic information retrieval. SIAM J. Comput. 33, 6 (2004), 1417--1440.
[26]
Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, and Carlo Zaniolo. 2018. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18).
[27]
Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2017. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17).
[28]
Xiao Chen. 2015. Crowdsourcing entity resolution: A short overview and open issues. In GvDB. 72--77.
[29]
Xiao Chen, Eike Schallehn, and Gunter Saake. 2018. Cloud-scale entity resolution: Current state and open challenges. OJBD 4, 1 (2018).
[30]
Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F. Naughton. 2014. Modeling entity evolution for temporal record matching. In SIGMOD. 1175--1186.
[31]
Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F. Naughton. 2014. Tracking entities in the dynamic world: A fast algorithm for matching temporal records. PVLDB 7, 6 (2014), 469--480.
[32]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP. 1724--1734.
[33]
P. Christen. 2008. Febrl—An open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD’08). 1065--1068.
[34]
Peter Christen. 2012. Data Matching. Springer.
[35]
Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. TKDE 24, 9 (2012), 1537--1555.
[36]
Peter Christen, Ross W. Gayler, and David Hawking. 2009. Similarity-aware indexing for real-time entity resolution. In CIKM. 1565--1568.
[37]
Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity Resolution in the Web of Data. Morgan 8 Claypool.
[38]
Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed data deduplication. PVLDB 9, 11 (2016), 864--875.
[39]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, K. Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In ICDE.
[40]
Aaron Clauset, Mark E. J. Newman, and Cristopher Moore. 2004. Finding community structure in very large networks. Physical Review E 70, 6 (2004), 066111.
[41]
William W. Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In SIGKDD.
[42]
Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD. 1431--1446.
[43]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[44]
Jeremy Debattista, Christoph Lange, Sören Auer, and Dominic Cortis. 2018. Evaluating the quality of the LOD cloud: An empirical investigation. Semantic Web 9, 6 (2018), 859--901.
[45]
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2013. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22, 5 (2013), 665--687.
[46]
Juan A. Díaz and Elena Fernández. 2001. A Tabu search heuristic for the generalized assignment problem. EJOR 132, 1 (2001), 22--38.
[47]
Dimas Cassimiro do Nascimento, Carlos Eduardo Santos Pires, and Demetrio Gomes Mestre. 2020. Exploiting block co-occurrence to control block sizes for entity resolution. Knowl. Inf. Syst. 62, 1 (2020), 359--400.
[48]
Xin Dong, Alon Y. Halevy, and Jayant Madhavan. 2005. Reference reconciliation in complex information spaces. In SIGMOD. 85--96.
[49]
Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan 8 Claypool.
[50]
Carina Friedrich Dorneles, Rodrigo Gonçalves, and Ronaldo dos Santos Mello. 2011. Approximate data instance matching: A survey. KAIS 27, 1 (01 Apr. 2011), 1--21.
[51]
Uwe Draisbach and Felix Naumann. 2010. DuDe: The duplicate detection toolkit. In QDB.
[52]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454--1467.
[53]
Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2019. Return of the Lernaean Hydra: Experimental evaluation of data series approximate similarity search. PVLDB 13, 3 (2019).
[54]
Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis Christophides. 2017. Matching web tables with knowledge base entities: From entity lookups to entity embeddings. In ISWC. 260--277.
[55]
Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65 (2017), 137--157.
[56]
Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, and Vassilis Christophides. 2019. MinoanER: Schema-agnostic, non-iterative, massively parallel resolution of web entities. In EDBT. 373--384.
[57]
Vasilis Efthymiou, Kostas Stefanidis, and Vassilis Christophides. 2015. Big data entity resolution: From highly to somehow similar entity descriptions in the Web. In IEEE Big Data. 401--410.
[58]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. TKDE 19, 1 (2007), 1--16.
[59]
Jeffrey L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179--211.
[60]
José Esquivel, Dyaa Albakour, Miguel Martinez-Alvarez, David Corney, and Samir Moussa. 2017. On the long-tail entities in news. In ECIR.
[61]
Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.
[62]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. PVLDB 2, 1 (2009), 407--418.
[63]
I. P. Fellegi and A. B. Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64 (1969), 1183--1210.
[64]
Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an oracle. PVLDB 9, 5 (2016), 384--395.
[65]
Jeffrey Fisher, Peter Christen, Qing Wang, and Erhard Rahm. 2015. A clustering-based framework to control block sizes for entity resolution. In Proceedings of the 21th ACM International Conference on Knowledge Discovery and Data Mining (KDD’15). 279--288.
[66]
Gary William Flake, Robert Endre Tarjan, and Kostas Tsioutsiouliklis. 2003. Graph clustering and minimum cut trees. Internet Mathematics 1, 4 (2003), 385--408.
[67]
Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37, 3 (1999), 277--296.
[68]
Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-end multi-perspective matching for entity resolution. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 4961--4967.
[69]
Avigdor Gal. 2014. Tutorial: Uncertain entity resolution. PVLDB 7, 13 (2014), 1711--1712.
[70]
Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2018. Robust entity resolution using random graphs. In SIGMOD. 3--18.
[71]
Nengneng Gao, Sheng-Jun Huang, Yifan Yan, and Songcan Chen. 2018. Cross modal similarity learning with active queries. Pattern Recogn. 75, C (2018), 214--222.
[72]
Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: Theory, practice 8 open challenges. PVLDB 5, 12 (2012), 2018--2019.
[73]
Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. 2007. Clustering aggregation. TKDD 1, 1 (2007), 4.
[74]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD. 601--612.
[75]
Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-Chiew Tan. 2017. Data integration: After the teenage years. In PODS. 101--106.
[76]
Yash Govind, Erik Paulson, Palaniappan Nagarajan, Paul Suganthan G. C., AnHai Doan, Youngchoon Park, Glenn Fung, Devin Conathan, Marshall Carter, and Mingju Sun. 2018. CloudMatcher: A hands-off cloud/crowd service for entity matching. PVLDB 11, 12 (2018), 2042--2045.
[77]
Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental record linkage. PVLDB 7, 9 (May 2014), 697--708.
[78]
M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. Millstein, and M. Kim. 2016. BigDebug: Debugging primitives for interactive big data processing in Spark. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). 784--795.
[79]
Sara Hajian, Francesco Bonchi, and Carlos Castillo. 2016. Algorithmic bias: From discrimination discovery to fairness-aware data mining. In KDD.
[80]
Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, and Hyun Chul Lee. 2009. Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2, 1 (2009), 1282--1293.
[81]
Oktie Hassanzadeh and Renée J. Miller. 2009. Creating probabilistic databases from duplicated data. VLDB J. 18, 5 (2009), 1141--1166.
[82]
Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. 2000. Scalable techniques for clustering the Web. In WebDB. 129--134.
[83]
Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, and Ryan Wisnesky. 2013. HIL: A high-level scripting language for entity integration. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). 549--560.
[84]
Mauricio A. Hernàndez and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In SIGMOD. 127--138.
[85]
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell. 194 (2013), 28--61.
[86]
Jeff Howe. 2006. The rise of crowdsourcing. Wired Magazine 14, 6 (2006), 1--4.
[87]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.
[88]
Ekaterini Ioannou, Wolfgang Nejdl, Claudia Niederée, and Yannis Velegrakis. 2010. On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3, 1 (2010), 429--438.
[89]
Ekaterini Ioannou, Claudia Niederée, and Wolfgang Nejdl. 2008. Probabilistic entity linkage for heterogeneous information spaces. In CAiSE.
[90]
Ekaterini Ioannou, Nataliya Rassadko, and Yannis Velegrakis. 2013. On generating benchmark data for entity matching. J. Data Semantics 2, 1 (2013), 37--56.
[91]
Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. PVLDB 5, 11 (2012), 1638--1649.
[92]
Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3 (1999), 264--323.
[93]
Pawel Jurczyk, James J. Lu, Li Xiong, Janet D. Cragan, and Adolfo Correa. 2008. Fine-grained record integration and linkage tool. BDR 82, 11 (2008).
[94]
Anna Jurek, Jun Hong, Yuan Chi, and Weiru Liu. 2017. A novel ensemble learning approach to unsupervised record linkage. Inf. Syst. 71 (2017), 40--54.
[95]
Alexandros Karakasidis and Evaggelia Pitoura. 2019. Identifying bias in name matching tasks. In EDBT. 626--629.
[96]
Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2018. Summarization algorithms for record linkage. In EDBT. 73--84.
[97]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL’19). 5851--5861.
[98]
Xiangyu Ke, Michelle Teo, Arijit Khan, and Vijaya Krishna Yalavarthi. 2018. A demonstration of PERC: Probabilistic entity resolution with crowd errors. PVLDB 11, 12 (2018), 1922--1925.
[99]
Mayank Kejriwal and Daniel P. Miranker. 2013. An unsupervised algorithm for learning blocking schemes. In ICDM. 340--349.
[100]
Mayank Kejriwal and Daniel P. Miranker. 2014. A two-step blocking scheme learner for scalable link discovery. In OM. 49--60.
[101]
Mayank Kejriwal and Daniel P. Miranker. 2015. A DNF blocking scheme learner for heterogeneous datasets. CoRR abs/1501.01694.
[102]
Mayank Kejriwal and Daniel P. Miranker. 2015. An unsupervised instance matcher for schema-free RDF data. J. Web Sem. 35 (2015), 102--123.
[103]
Asif R. Khan and Hector Garcia-Molina. 2016. Attribute-based crowd entity resolution. In CIKM. 549--558.
[104]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward building entity matching management systems. PVLDB 9, 12 (2016).
[105]
Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 2 (2010), 197--210.
[106]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1 (2010).
[107]
Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. 2006. Record linkage: Similarity measures and algorithms. In SIGMOD. 802--803.
[108]
Jerome M. Kurtzberg. 1962. On approximation methods for the assignment problem. J. ACM 9, 4 (1962), 419--439.
[109]
Shrinu Kushagra, Hemant Saxena, Ihab F. Ilyas, and Shai Ben-David. 2019. A semi-supervised framework of clustering selection for de-duplication. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE’19). 208--219.
[110]
Selasi Kwashie, Jixue Liu, Jiuyong Li, Lin Liu, Markus Stumptner, and Lujing Yang. 2019. Certus: An effective entity resolution approach with graph differential dependencies (GDDs). PVLDB 12, 6 (2019), 653--666.
[111]
Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. SIGMa: Simple greedy matching for aligning large knowledge bases. In SIGKDD. 572--580.
[112]
Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for LongTail verticals. PVLDB 10, 11 (2017).
[113]
Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017. Crowdsourced data management: Overview and challenges. In SIGMOD.
[114]
Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. 2009. RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21, 8 (2009), 1218--1232.
[115]
Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging DISC analytics. In SoCC. 17:1--17:15.
[116]
Yongtao Ma and Thanh Tran. 2013. TYPiMatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). 325--334.
[117]
Claire Mathieu, Ocan Sankur, and Warren Schudy. 2010. Online correlation clustering. In STACS. 573--584.
[118]
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000). 169--178.
[119]
W. P. McNeill, Hakan Kardes, and Andrew Borthwick. 2012. Dynamic record blocking: Efficient linking of massive databases in mapreduce. In Proceedings of the 10th International Workshop on Quality in Databases (QDB’12).
[120]
David G. McVitie and Leslie B. Wilson. 1970. Stable marriage assignment for unequal sets. BIT Numerical Mathematics 10, 3 (1970).
[121]
Grégoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio. 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In INTERSPEECH. 3771--3775.
[122]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.
[123]
Charini Nanayakkara, Peter Christen, and Thilina Ranbaduge. 2019. Robust temporal graph clustering for group record linkage. In PAKDD.
[124]
Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan 8 Claypool.
[125]
E. D. Nelson and J. R. Talburt. 2011. Entity resolution for longitudinal studies in education using OYSTER. In IKE.
[126]
Markus Nentwig, Anika Groß, and Erhard Rahm. 2016. Holistic entity clustering for linked data. In IEEE ICDM Workshops. 194--201.
[127]
Markus Nentwig, Michael Hartung, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. 2017. A survey of current Link Discovery frameworks. Sem. Web 8, 3 (2017), 419--436.
[128]
Axel-Cyrille Ngonga Ngomo and Sören Auer. 2011. LIMES—A time-efficient approach for large-scale link discovery on the web of data. In IJCAI.
[129]
Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. In NIPS. 6338--6347.
[130]
Andriy Nikolov, Victoria S. Uren, Enrico Motta, and Anne N. De Roeck. 2008. Integration of semantically annotated data by the KnoFuss architecture. In Proceedings of the 6th International Conference on Knowledge Engineering: Practice and Patterns (EKAW’08). 265--274.
[131]
Jordi Nin, Victor Muntés-Mulero, Norbert Martínez-Bazan, and Josep-Lluís Larriba-Pey. 2007. On the use of semantic blocking techniques for data cleansing and integration. In Proceedings of the 11th International Database Engineering and Applications Symposium (IDEAS’07). 190--198.
[132]
Kevin O’Hare, Anna Jurek-Loughrey, and Cassio de Campos. 2019. A review of unsupervised and semi-supervised blocking methods for record linkage. In Linking and Mining Heterogeneous and Multi-view Data. Springer, 79--105.
[133]
George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. 2015. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB 9, 4 (2015), 312--323.
[134]
George Papadakis, Konstantina Bereta, Themis Palpanas, and Manolis Koubarakis. 2017. Multi-core meta-blocking for big linked data. In SEMANTICS.
[135]
George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2012. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In The 5th International Conference on Web Search and Web Data Mining (WSDM’12). 53--62.
[136]
George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE 25, 12 (2013), 2665--2682.
[137]
George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. TKDE 26, 8 (2014).
[138]
George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Supervised meta-blocking. PVLDB 7, 14 (2014), 1929--1940.
[139]
George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 19th International Conference on Extending Database Technology (EDBT’16). 221--232.
[140]
George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. A survey of blocking and filtering techniques for entity resolution. ACM Comput. Surv. 53, 2 (2020), 31:1--31:42.
[141]
George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9, 9 (2016), 684--695.
[142]
George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, Nikiforos Pittaras, Giovanni Simonini, Dimitrios Skoutas, Paul Isaris, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2020. JedAI3: Beyond batch, blocking-based entity resolution. In EDBT. 603--606.
[143]
Thorsten Papenbrock, Arvid Heise, and Felix Naumann. 2015. Progressive duplicate detection. IEEE TKDE 27, 5 (2015), 1316--1329.
[144]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.
[145]
Banda Ramadan and Peter Christen. 2014. Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In CIKM.
[146]
Banda Ramadan, Peter Christen, Huizhi Liang, and Ross W. Gayler. 2015. Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Inf. Quality 6, 4, Article 15 (2015), 15:1--15:29.
[147]
Banda Ramadan, Peter Christen, Huizhi Liang, Ross W. Gayler, and David Hawking. 2013. Dynamic similarity-aware inverted indexing for real-time entity resolution. In Trends and Applications in Knowledge Discovery and Data Mining—PAKDD International Workshops. 47--58.
[148]
Vibhor Rastogi, Nilesh N. Dalvi, and Minos N. Garofalakis. 2011. Large-scale collective entity matching. PVLDB 4, 4 (2011), 208--218.
[149]
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. PVLDB 11, 3 (2017), 269--282.
[150]
Orion Fausto Reyes-Galaviz, Witold Pedrycz, Ziyue He, and Nick J. Pizzi. 2017. A supervised gradient-based learning algorithm for optimized entity resolution. Data Knowl. Eng. 112 (2017), 106--129.
[151]
Stephen V Rice. 2007. Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In Proceedings of the MultiConference on Computer Simulation. 150--155.
[152]
Shu Rong, Xing Niu, Evan Wei Xiang, Haofen Wang, Qiang Yang, and Yong Yu. 2012. A machine learning approach for instance matching based on similarity metrics. In Proceedings of the 11th International Semantic Web Conference (ISWC’12). 460--475.
[153]
Alieh Saeedi, Markus Nentwig, Eric Peukert, and Erhard Rahm. 2018. Scalable matching and clustering of entities with FAMER. CSIMQ 16 (2018), 61--83.
[154]
Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2017. Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In Proceedings of the 21st European Conference on Advances in Databases and Information Systems (ADBIS’17). 278--293.
[155]
Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using link features for entity clustering in knowledge graphs. In Proceedings of the 15th International Conference ESWC. 576--592.
[156]
Murat Sariyar, Andreas Borg, and Klaus Pommerening. 2011. Controlling false match rates in record linkage using extreme value theory. Journal of Biomedical Informatics 44, 4 (2011), 648--654.
[157]
Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). 1055--1064.
[158]
Andrew T. Schneider, Arjun Mukherjee, and Eduard C. Dragut. 2018. Leveraging social media signals for record linkage. In Proceedings of the 2018 World Wide Web Conference on World Wide Web (WWW’18). 1195--1204.
[159]
Chao Shao, Linmei Hu, Juan-Zi Li, Zhichun Wang, Tong Lee Chung, and Jun-Bo Xia. 2016. RiMOM-IM: A novel iterative framework for instance matching. J. Comput. Sci. Technol. 31, 1 (2016), 185--197.
[160]
Liangcai Shu, Aiyou Chen, Ming Xiong, and Weiyi Meng. 2011. Efficient spectral neighborhood blocking for entity resolution. In Proceedings of the 27th International Conference on Data Engineering (ICDE’11). 1067--1078.
[161]
Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9, 12 (2016), 1173--1184.
[162]
Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83 (2019), 145--165.
[163]
Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2019. Schema-agnostic progressive entity resolution. IEEE TKDE 31, 6 (2019), 1208--1221.
[164]
Y. Sismanis, L. Wang, A. Fuxman, P. J. Haas, and B. Reinwald. 2009. Resolution-aware query answering for business intelligence. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09). 976--987.
[165]
Dezhao Song and Jeff Heflin. 2011. Automatically generating data linkages using a domain-independent candidate selection approach. In Proceedings of the 10th International Semantic Web Conference (ISWC’11). 649--664.
[166]
Kostas Stefanidis, Vasilis Efthymiou, Melanie Herschel, and Vassilis Christophides. 2014. Entity resolution in the web of data. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), Companion Volume. 203--204.
[167]
Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the 2014 International Conference on Privacy in Statistical Databases (PSD’14). 253--268.
[168]
Weifeng Su, Jiying Wang, and Frederick H. Lochovsky. 2010. Record matching over query results from multiple web databases. IEEE TKDE 22, 4 (2010), 578--589.
[169]
Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic alignment of relations, instances, and schema. PVLDB 5, 3 (2011), 157--168.
[170]
Zequn Sun, Wei Hu, and Chengkai Li. 2017. Cross-lingual entity alignment via joint attribute-preserving embedding. In Proceedings of the 16th International Semantic Web Conference (ISWC’17). 628--644.
[171]
Zequn Sun, Wei Hu, Qingheng Zhang, and Yuzhong Qu. 2018. Bootstrapping entity alignment with knowledge graph embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 4396--4402.
[172]
Zequn Sun, Qingheng Zhang, Wei Hu, Chengming Wang, Muhao Chen, Farahnaz Akrami, and Chengkai Li. 2020. A benchmarking study of embedding-based entity alignment for knowledge graphs. CoRR abs/2003.07743.
[173]
Saravanan Thirumuruganathan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. CoRR abs/1809.11084.
[174]
Bayu Distiawan Trisedya, Jianzhong Qi, and Rui Zhang. 2019. Entity alignment between knowledge graphs using attribute embeddings. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 297--304.
[175]
Stijn Marinus Van Dongen. 2000. Graph Clustering by Flow Simulation. Ph.D. Dissertation. Utrecht University.
[176]
Marieke van Erp, Pablo N. Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo, and Jörg Waitelonis. 2016. Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).
[177]
Vasilis Verroios and Hector Garcia-Molina. 2015. Entity resolution with crowd errors. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE’15). 219--230.
[178]
Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. 2017. Waldo: An adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). 1133--1148.
[179]
Norases Vesdapunt, Kedar Bellare, and Nilesh N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. PVLDB 7, 12 (2014), 1071--1082.
[180]
Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. 2009. Silk—A link discovery framework for the web of data. In Proceedings of the WWW2009 Workshop on Linked Data on the Web (LDOW’09).
[181]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. PVLDB 5, 11 (2012), 1483--1494.
[182]
Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 469--480.
[183]
Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 229--240.
[184]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. PVLDB 4, 10 (2011), 622--633.
[185]
Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1263--1277.
[186]
Xiaolan Wang, Laura M. Haas, and Alexandra Meliou. 2018. Explaining data integration. IEEE Data Eng. Bull. 41, 2 (2018), 47--58.
[187]
Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. 2017. Discovering conditional matching rules. TKDD 11, 4 (2017), 46:1--46:38.
[188]
Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. 2018. Cross-lingual knowledge graph alignment via graph convolutional networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 349--357.
[189]
Melanie Weis and Felix Naumann. 2004. Detecting duplicate objects in XML documents. In Proceedings of the International Workshop on Information Quality in Information Systems (IQIS’04). 10--19.
[190]
Melanie Weis and Felix Naumann. 2006. Detecting duplicates in complex XML data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 109.
[191]
Michael J. Welch, Aamod Sane, and Chris Drome. 2012. Fast and accurate incremental entity resolution relative to an entity knowledge base. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). 2667--2670.
[192]
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. PVLDB 6, 6 (2013), 349--360.
[193]
Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-as-you-go entity resolution. IEEE TKDE 25, 5 (2013), 1111--1124.
[194]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. 2009. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 219--232.
[195]
Derry Tanti Wijaya and Stéphane Bressan. 2009. Ricochet: A family of unconstrained algorithms for graph clustering. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA’09). 153--167.
[196]
Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1, 2 (1989), 270--280.
[197]
Vijaya Krishna Yalavarthi, Xiangyu Ke, and Arijit Khan. 2017. Select your questions wisely: For entity resolution with crowd errors. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM’17). 317--326.
[198]
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. 2006. Similarity Search—The Metric Space Approach. Kluwer.
[199]
Chen Jason Zhang, Rui Meng, Lei Chen, and Feida Zhu. 2015. CrowdLink: An error-tolerant model for linking complex records. In Proceedings of the 2nd International Workshop on Exploratory Search in Databases and the Web (ExploreDB’15). 15--20.
[200]
Fulin Zhang, Zhipeng Gao, and Kun Niu. 2017. A pruning algorithm for Meta-blocking based on cumulative weight. In JPCS, Vol. 887.
[201]
Qingheng Zhang, Zequn Sun, Wei Hu, Muhao Chen, Lingbing Guo, and Yuzhong Qu. 2019. Multi-view knowledge graph embedding for entity alignment. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 5429--5435.
[202]
Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and David Page. 2020. AutoBlock: A hands-off blocking framework for entity matching. In The 13th ACM International Conference on Web Search and Data Mining (WSDM’20). 744--752.
[203]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference (WWW’19). 2413--2424.
[204]
Qibin Zheng, Xingchun Diao, Jianjun Cao, Xiaolei Zhou, Yi Liu, and Hongmei Li. 2018. Multi-modal space structure: A new kind of latent correlation for multi-modal entity resolution. CoRR abs/1804.08010.
[205]
Hao Zhu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2017. Iterative entity alignment via joint knowledge embeddings. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4258--4264.

Cited By

View all
  • (2025)TREATS: Fairness-aware entity resolution over streaming dataInformation Systems10.1016/j.is.2024.102506129(102506)Online publication date: Mar-2025
  • (2025)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00879-434:1Online publication date: 1-Jan-2025
  • (2024)Processing the Narrative: Innovative Graph Models and Queries for Textual Content Knowledge ExtractionElectronics10.3390/electronics1318368813:18(3688)Online publication date: 17-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 53, Issue 6
Invited Tutorial and Regular Papers
November 2021
803 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3441629
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2020
Accepted: 01 August 2020
Revised: 01 July 2020
Received: 01 April 2020
Published in CSUR Volume 53, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Entity blocking and matching
  2. batch and incremental entity resolution workflows
  3. block processing
  4. crowdsourcing
  5. deep learning
  6. strongly and nearly similar entities

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)497
  • Downloads (Last 6 weeks)50
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)TREATS: Fairness-aware entity resolution over streaming dataInformation Systems10.1016/j.is.2024.102506129(102506)Online publication date: Mar-2025
  • (2025)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00879-434:1Online publication date: 1-Jan-2025
  • (2024)Processing the Narrative: Innovative Graph Models and Queries for Textual Content Knowledge ExtractionElectronics10.3390/electronics1318368813:18(3688)Online publication date: 17-Sep-2024
  • (2024)Graph Deep Active Learning Framework for Data DeduplicationBig Data Mining and Analytics10.26599/BDMA.2023.90200407:3(753-764)Online publication date: Sep-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 1-Jul-2024
  • (2024) py J ed AI: A Library with Resolution-Related Structures and Procedures for Products INFORMS Journal on Computing10.1287/ijoc.2023.0410Online publication date: 9-Sep-2024
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Connected Components for Scaling Partial-order Blocking to Billion EntitiesJournal of Data and Information Quality10.1145/364655316:1(1-29)Online publication date: 19-Mar-2024
  • (2024)moduli: A Disaggregated Data Management Architecture for Data-Intensive WorkflowsACM SIGWEB Newsletter10.1145/3643603.36436072024:Winter(1-16)Online publication date: 20-Feb-2024
  • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 28-Feb-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media