Abstract
Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.
In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.
This work was supported by the EU FP7 SEC project SCIIMS (Ref. 218223).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh: A family of algorithms for generic, distributed entity resolution. In: Proc. 27th Int. Conf. on Distributed Computing Systems (2007)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007)
Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proc. 12th ACM SIGKDD, pp. 529–534 (2006)
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 25–33. ACM, New York (2008)
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proc. 9th ACM SIGKDD, pp. 39–48 (2003)
Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Efficient Closed Pattern Mining in Strongly Accessible Set Systems (Extended Abstract). In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 382–389. Springer, Heidelberg (2007)
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007, pp. 437–448. ACM (2007)
Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008, pp. 151–159. ACM (2008)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. In: IEEE TKDE preprint (2011)
Christen, P., Churches, T., Hegland, M.: Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)
Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: CIKM 2009, pp. 1565–1568. ACM (2009)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press (2001)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE TKDE, 1–16 (2007)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3, 417–428 (2010)
Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008, pp. 310–317. ACM (2008)
Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 1065–1069 (2005)
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: A peta-scale graph mining system implementation and observations. In: ICDM, pp. 229–238. IEEE (2009)
Kim, H.-S., Lee, D.: Parallel linkage. In: CIKM 2007. ACM (2007)
Kirsten, T., Kolb, L., Hartung, M., Gross, A., Köpcke, H., Rahm, E.: Data partitioning for parallel entity matching. Computing Research Repository (2010)
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197–210 (2010)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)
McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proc. 14th Int. Conf. on Artificial Intelligence, pp. 1050–1055 (1995)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)
Menestrina, D., Whang, S.E., Garcia-Molina, H.: Evaluating entity resolution results. Proc. VLDB Endow. 3, 208–219 (2010)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD, pp. 269–278 (2002)
Sidló, C.I.: Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 59–73. Springer, Heidelberg (2009)
Sidló, C.I.: Entity resolution with heavy indexing. In: Proc. ADBIS, CEUR Workshop Proceedings (2011)
Sidló, C.I., Garzó, A., Molnár, A., Benczúr, A.A.: Infrastructures and bounds for distributed entity resolution. In: 9th International Workshop on Quality in Databases (2011)
Talburt, J.R.: Entity Resolution and Information Quality, 1st edn. Morgan Kaufmann (2010)
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. Proc. of the VLDB Endow. 1(2), 1253–1264 (2008)
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB Endow. 3, 1326–1337 (2010)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proc. 35th Int. Conf. on Management of Data, pp. 219–232. ACM (2009)
White, T.: Hadoop: The Definitive Guide. Yahoo Press (2010)
Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008, pp. 722–730. ACM (2008)
Yakout, M., Elmagarmid, A.K., Elmeleegy, H., Ouzzani, M., Qi, A.: Behavior based record linkage. Proc. VLDB Endow. 3, 439–448 (2010)
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926. ACM (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Molnár, A.J., Benczúr, A.A., Sidló, C.I. (2012). Flexible and Efficient Distributed Resolution of Large Entities. In: Lukasiewicz, T., Sali, A. (eds) Foundations of Information and Knowledge Systems. FoIKS 2012. Lecture Notes in Computer Science, vol 7153. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28472-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-28472-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28471-7
Online ISBN: 978-3-642-28472-4
eBook Packages: Computer ScienceComputer Science (R0)