Abstract
Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We derive a mathematical formulation for a general class of record linkage problems in multipartite entity resolution across many datasets as a combinatorial optimization problem known as the multidimensional assignment problem. As a motivation for our approach, we illustrate the advantage of multipartite entity resolution over sequential bipartite matching. Because the optimization problem is NP-hard, we apply two heuristic procedures, a Greedy algorithm and very large scale neighborhood search, to solve the assignment problem and find the most likely matching of records from multiple datasets into a single entity. We evaluate and compare the performance of these algorithms and their modifications on synthetically generated data. We perform computational experiments to compare performance of recent heuristic, the very large-scale neighborhood search, with a Greedy algorithm, another heuristic for the MAP, as well as with two versions of genetic algorithm, a general metaheuristic. Importantly, we perform experiments to compare two alternative methods of re-starting the search for the former heuristic, specifically a random-sampling multi-start and a deterministic design-based multi-start. We find evidence that design-based multi-start can be more efficient as the size of databases grow large. In addition, we show that very large scale search, especially its multi-start version, outperforms simple Greedy heuristic. Hybridization of Greedy search with very large scale neighborhood search improves the performance. Using multi-start with as few as three additional runs of very large scale search offers some improvement in the performance of the very large scale search procedure. Last, we propose an approach to evaluating complexity of the very large-scale neighborhood search.
Similar content being viewed by others
Data Availibility
All data generated or analysed during this study are included in this published article.
References
Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L. (eds).: Proceedings of the Third International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA, May 17–20, 2009. The AAAI Press (2009)
Arbib, C., Pacciarelli, D., Smriglio, S.: A three-dimensional matching model for perishable production scheduling. Discrete Appl. Math. 92(1), 1–15 (1999)
Balas, E., Landweer, P.R.: Traffic assignment in communication satellites. Oper. Res. Lett. 2(4), 141–147 (1983)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Brizan, D., Tansel, A.: A survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 41–50 (2006)
Burkard, R., Dell’Amico, M., Martello, S.: Assignment problems, revised reprint, volume 106. SIAM (2012)
Burkard, R.E., Cela, E.: Linear assignment problems and extensions. In: Handbook of combinatorial optimization, pp. 75–149. Springer (1999)
Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. Proc. VLDB Endow. 9(11), 864–875 (2016)
Crama, Y., Flippo, O.E., Van de Klundert, J., Spieksma, F.C.: The assembly of printed circuit boards: a case with multiple machines and multiple board types. Eur. J. Oper. Res. 98(3), 457–472 (1997)
Crama, Y., Oerlemans, A.G., Spieksma, F.C.: Production Planning in Automated Manufacturing. Springer, Berlin (2012)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. Proc. VLDB Endow. 9(5), 384–395 (2016)
Frieze, A., Yadegar, J.: An algorithm for solving 3-dimensional assignment problems with application to scheduling a teaching practice. J. Oper. Res. Soc. 989–995 (1981)
Garey, M.R., Johnson, D.S.: Computers and Itractability, volume 174. Feeman San Francisco (1979)
Gilbert, K.C., Hofstra, R.B.: An algorithm for a class of three-dimensional assignment problems arising in scheduling applications. IIE Trans. 19(1), 29–33 (1987)
Gilbert, K.C., Hofstra, R.B.: Multidimensional assignment problems. Decis. Sci. 19(2), 306–321 (1988)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: roceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pp. 601–612, New York, NY, USA. ACM (2014)
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3(1–2), 417–428 (2010)
Gutin, G., Goldengorin, B., Huang, J.: Worst case analysis of max-regret, greedy and other heuristics for multidimensional assignment and traveling salesman problems. J. Heurist. 14(2), 169–181 (2008)
He, J., Liu, H., Lau, R.Y.K., He, J.: Relationship identification across heterogeneous online social networks. Comput. Intell. (2016)
Helbing, D., Balietti, S.: From social data mining to forecasting socio-economic crises. Eur. Phys. J. Spec. Top. 195(1), 3 (2011)
Hilton, A.: The reconstruction of latin squares with applications to school timetabling and to experimental design. In: Combinatorial Optimization II, pp. 68–77. Springer (1980)
Jain, P., Kumaraguru, P., Joshi, A.: @i seek ’fb.me’: identifying users across multiple online social networks. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion, pp. 1259–1268, New York, NY, USA. ACM (2013)
Kammerdiner, A.: Ranking risk exposures for situational surveillance of falls with sensors. Oper. Res. Health Care 7, 132–137 (2015)
Kammerdiner, A., Krokhmal, P., Pardalos, P.: Characteristics of the distribution of hamming distance values between multidimensional assignment problem solutions. Advances in Cooperative Control and Optimization, pp. 339–352 (2007)
Kammerdiner, A., Vaughan, C.F.: Very large-scale neighborhood search for the multidimensional assignment problem. In: Butenko, S., Pardalos, P.M., Shylo, V. (eds.), Optimization Methods and Applications. Springer (2017)
Kammerdiner, A.R.: Multidimensional assignment problem multidimensional assignment problem. In: Encyclopedia of Optimization, pp. 2396–2402. Springer (2008)
Kammerdiner, A.R., Guererro, A.N.: Data-driven combinatorial optimization for sensor-based assessment of near falls. Ann. Oper. Res. 276(1–2), 137–153 (2019)
Kammerdiner, A.R., Mucherino, A., Pardalos, P.M.: Application of monkey search meta-heuristic to solving instances of the multidimensional assignment problem. In: Optimization and Cooperative Control Strategies, pp. 385–397. Springer (2009)
Karapetyan, D., Gutin, G.: Local search heuristics for the multidimensional assignment problem. J. Heuristics 17(3), 201–249 (2011)
Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Computer Computations, pp. 85–103. Springer (1972)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Krokhmal, P.A.: On optimality of a polynomial algorithm for random linear multidimensional assignment problem. Optim. Lett. 5(1), 153–164 (2011)
Li, F., Lee, M.L., Hsu, W., Tan, W.-C.: Linking temporal records for profiling entities. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 593–605, New York, NY, USA. ACM (2015)
Nguyen, D.M., Le Thi, H.A., Dinh, T.P.: Solving the multidimensional assignment problem by a cross-entropy method. J. Comb. Optim. 27(4), 808–823 (2014)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Pasiliao, E.L., Jr.: Local neighborhoods for the multidimensional assignment problem. In: Dynamics of Information Systems, pp. 353–371. Springer (2010)
Pierskalla, W.P.: The tri-substitution method for the three-dimensional assignment problem. CORS J. 5, 71–81 (1967)
Pierskalla, W.P.: Letter to the editor-the multidimensional assignment problem. Oper. Res. 16(2), 422–431 (1968)
Poore, A., Rijavec, N., Liggins, M., Vannicola, V.: Data association problems posed as multidimensional assignment problems: problem formulation. In: Optical Engineering and Photonics in Aerospace Sensing, pp. 552–563. International Society for Optics and Photonics (1993)
Puglisi, S., Rebollo-Monedero, D., Forné, J.: On Web user tracking: how third-party http requests track users’ browsing patterns for personalised advertising. In: 2016 Mediterranean Ad Hoc Networking Workshop (Med-Hoc-Net), pp. 1–6 (2016)
Pusztaszeri, J.-F.: The nonlinear assignment problem in experimental high energy physics. In: Nonlinear Assignment Problems, pp. 55–89. Springer (2000)
Pusztaszeri, J.-F., Rensing, P.E., Liebling, T.M.: Tracking elementary particles near their primary vertex: a combinatorial approach. J. Global Optim. 9(1), 41–64 (1996)
Riederer, C., Kim, Y., Chaintreau, A., Korula, N., Lattanzi, S.: Linking users across domains with location data: theory and validation. In: Proceedings of the 25th International Conference on World Wide Web, WWW ’16, pp. 707–719, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2016)
Sagi, T., Gal, A., Barkol, O., Bergman, R., Avram, A.: Multi-source uncertain entity resolution at yad vashem: Transforming holocaust victim reports into people. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 807–819, New York, NY, USA. ACM (2016)
Semenov, A., Veijalainen, J.: A modelling framework for social media monitoring. Int. J. Web Eng. Technol. 8(3), 217–249 (2013). (PMID: 57226)
Silva, R.M., Resende, M.G., Pardalos, P.M.: Finding multiple roots of a box-constrained system of nonlinear equations with a biased random-key genetic algorithm. J. Global Optim. 60(2), 289–306 (2014)
Tang, J., Zafarani, R., Shu, K., Wang, S., Liu, H.: User identity linkage across online social netwroks: A review. In: To appear in SIGKDD Explorations (2016)
Vogiatzis, C., Pasiliao, E.L., Pardalos, P.M.: Graph partitions for the multidimensional assignment problem. Comput. Optim. Appl. 58(1), 205–224 (2014)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)
Ye, T., Lauw, H.W.: Structural constraints for multipartite entity resolution with markov logic network. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pp. 1691–1694, New York, NY, USA. ACM (2015)
Zhang, D., Rubinstein, B.I.P., Gemmell, J.: Principled graph matching algorithms for integrating multiple data sources. IEEE Trans. Knowl. Data Eng. 27(10), 2784–2796 (2015)
Zhang, J., Yu, P.S.: Multiple anonymized social networks alignment. In: 2015 IEEE International Conference on Data Mining, pp. 599–608 (2015)
Zhou, X., Liang, X., Zhang, H., Ma, Y.: Cross-platform identification of anonymous identical users in multiple social media networks. IEEE Trans. Knowl. Data Eng. 28(2), 411–424 (2016)
Zhou, X.-H., Gao, S.: Confidence intervals for the log-normal mean. Stat. Med. 16(7), 783–790 (1997)
Acknowledgements
A. Kammerdiner was supported by the AFRL (National Research Council Fellowship). The work of A. Semenov was funded in part by the AFRL European Office of Aerospace Research and Development (Grant FA9550-17-1-0030, University of Jyvaskyla, Finland).
Author information
Authors and Affiliations
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kammerdiner, A., Semenov, A. & Pasiliao, E.L. Multidimensional Assignment Problem for Multipartite Entity Resolution. J Glob Optim 84, 491–523 (2022). https://doi.org/10.1007/s10898-022-01141-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-022-01141-3