Abstract
Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records’ attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.
Similar content being viewed by others
References
Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for distributed probabilistic data. Distrib. Parallel Databases 31(4), 509–542 (2013)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 39–48 (2003)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 865–876 (2005)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)
Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. (JDIQ) 2(2), 10 (2011)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 90–101 (2003)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Li, L., Wang, H., Gao, H., Li, J.: Eif: a framework of effective entity identification. In: International Conference on Web-Age Information Management, pp. 717–728. Springer, New York (2010)
Li, L., Li, J., Wang, H., Gao, H.: Context-based entity description rule for entity resolution. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, pp. 1725–1730 (2011)
Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)
Saha, T.K., Zhang, B., Al Hasan, M.: Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 5(1), 11 (2015)
Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: IEEE 25th International Conference on Data Engineering. ICDE’09. IEEE, pp. 880–891 (2009)
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)
Yin, X., Han, J., Philip, S.Y.: Object distinction: distinguishing objects with identical names. In: IEEE 23rd International Conference on Data Engineering. ICDE 2007. IEEE, pp. 1242–1246 (2007)
Acknowledgements
This paper was partially supported by NSFC Grant U1509216, The National Key Research and Development Program of China 2016YFB1000703, NSFC Grant 61472099,61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Provience LC2016026. The authors would like to thank Prof. Hong Gao and Prof. Jianzhong Li for their support in this work and also the anonymous reviewers for their valuable comments that greatly improved this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abu Ahmad, H., Wang, H. An effective weighted rule-based method for entity resolution. Distrib Parallel Databases 36, 593–612 (2018). https://doi.org/10.1007/s10619-018-7240-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7240-6