[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

Multidimensional Assignment Problem for Multipartite Entity Resolution

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We derive a mathematical formulation for a general class of record linkage problems in multipartite entity resolution across many datasets as a combinatorial optimization problem known as the multidimensional assignment problem. As a motivation for our approach, we illustrate the advantage of multipartite entity resolution over sequential bipartite matching. Because the optimization problem is NP-hard, we apply two heuristic procedures, a Greedy algorithm and very large scale neighborhood search, to solve the assignment problem and find the most likely matching of records from multiple datasets into a single entity. We evaluate and compare the performance of these algorithms and their modifications on synthetically generated data. We perform computational experiments to compare performance of recent heuristic, the very large-scale neighborhood search, with a Greedy algorithm, another heuristic for the MAP, as well as with two versions of genetic algorithm, a general metaheuristic. Importantly, we perform experiments to compare two alternative methods of re-starting the search for the former heuristic, specifically a random-sampling multi-start and a deterministic design-based multi-start. We find evidence that design-based multi-start can be more efficient as the size of databases grow large. In addition, we show that very large scale search, especially its multi-start version, outperforms simple Greedy heuristic. Hybridization of Greedy search with very large scale neighborhood search improves the performance. Using multi-start with as few as three additional runs of very large scale search offers some improvement in the performance of the very large scale search procedure. Last, we propose an approach to evaluating complexity of the very large-scale neighborhood search.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availibility

All data generated or analysed during this study are included in this published article.

References

  1. Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L. (eds).: Proceedings of the Third International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA, May 17–20, 2009. The AAAI Press (2009)

  2. Arbib, C., Pacciarelli, D., Smriglio, S.: A three-dimensional matching model for perishable production scheduling. Discrete Appl. Math. 92(1), 1–15 (1999)

    Article  MathSciNet  Google Scholar 

  3. Balas, E., Landweer, P.R.: Traffic assignment in communication satellites. Oper. Res. Lett. 2(4), 141–147 (1983)

    Article  Google Scholar 

  4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  5. Brizan, D., Tansel, A.: A survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 41–50 (2006)

    Google Scholar 

  6. Burkard, R., Dell’Amico, M., Martello, S.: Assignment problems, revised reprint, volume 106. SIAM (2012)

  7. Burkard, R.E., Cela, E.: Linear assignment problems and extensions. In: Handbook of combinatorial optimization, pp. 75–149. Springer (1999)

  8. Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. Proc. VLDB Endow. 9(11), 864–875 (2016)

    Article  Google Scholar 

  9. Crama, Y., Flippo, O.E., Van de Klundert, J., Spieksma, F.C.: The assembly of printed circuit boards: a case with multiple machines and multiple board types. Eur. J. Oper. Res. 98(3), 457–472 (1997)

    Article  Google Scholar 

  10. Crama, Y., Oerlemans, A.G., Spieksma, F.C.: Production Planning in Automated Manufacturing. Springer, Berlin (2012)

    MATH  Google Scholar 

  11. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  12. Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. Proc. VLDB Endow. 9(5), 384–395 (2016)

    Article  Google Scholar 

  13. Frieze, A., Yadegar, J.: An algorithm for solving 3-dimensional assignment problems with application to scheduling a teaching practice. J. Oper. Res. Soc. 989–995 (1981)

  14. Garey, M.R., Johnson, D.S.: Computers and Itractability, volume 174. Feeman San Francisco (1979)

  15. Gilbert, K.C., Hofstra, R.B.: An algorithm for a class of three-dimensional assignment problems arising in scheduling applications. IIE Trans. 19(1), 29–33 (1987)

    Article  Google Scholar 

  16. Gilbert, K.C., Hofstra, R.B.: Multidimensional assignment problems. Decis. Sci. 19(2), 306–321 (1988)

    Article  Google Scholar 

  17. Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: roceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pp. 601–612, New York, NY, USA. ACM (2014)

  18. Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3(1–2), 417–428 (2010)

    Article  Google Scholar 

  19. Gutin, G., Goldengorin, B., Huang, J.: Worst case analysis of max-regret, greedy and other heuristics for multidimensional assignment and traveling salesman problems. J. Heurist. 14(2), 169–181 (2008)

    Article  Google Scholar 

  20. He, J., Liu, H., Lau, R.Y.K., He, J.: Relationship identification across heterogeneous online social networks. Comput. Intell. (2016)

  21. Helbing, D., Balietti, S.: From social data mining to forecasting socio-economic crises. Eur. Phys. J. Spec. Top. 195(1), 3 (2011)

    Article  Google Scholar 

  22. Hilton, A.: The reconstruction of latin squares with applications to school timetabling and to experimental design. In: Combinatorial Optimization II, pp. 68–77. Springer (1980)

  23. Jain, P., Kumaraguru, P., Joshi, A.: @i seek ’fb.me’: identifying users across multiple online social networks. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion, pp. 1259–1268, New York, NY, USA. ACM (2013)

  24. Kammerdiner, A.: Ranking risk exposures for situational surveillance of falls with sensors. Oper. Res. Health Care 7, 132–137 (2015)

    Article  Google Scholar 

  25. Kammerdiner, A., Krokhmal, P., Pardalos, P.: Characteristics of the distribution of hamming distance values between multidimensional assignment problem solutions. Advances in Cooperative Control and Optimization, pp. 339–352 (2007)

  26. Kammerdiner, A., Vaughan, C.F.: Very large-scale neighborhood search for the multidimensional assignment problem. In: Butenko, S., Pardalos, P.M., Shylo, V. (eds.), Optimization Methods and Applications. Springer (2017)

  27. Kammerdiner, A.R.: Multidimensional assignment problem multidimensional assignment problem. In: Encyclopedia of Optimization, pp. 2396–2402. Springer (2008)

  28. Kammerdiner, A.R., Guererro, A.N.: Data-driven combinatorial optimization for sensor-based assessment of near falls. Ann. Oper. Res. 276(1–2), 137–153 (2019)

    Article  MathSciNet  Google Scholar 

  29. Kammerdiner, A.R., Mucherino, A., Pardalos, P.M.: Application of monkey search meta-heuristic to solving instances of the multidimensional assignment problem. In: Optimization and Cooperative Control Strategies, pp. 385–397. Springer (2009)

  30. Karapetyan, D., Gutin, G.: Local search heuristics for the multidimensional assignment problem. J. Heuristics 17(3), 201–249 (2011)

    Article  Google Scholar 

  31. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Computer Computations, pp. 85–103. Springer (1972)

  32. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  33. Krokhmal, P.A.: On optimality of a polynomial algorithm for random linear multidimensional assignment problem. Optim. Lett. 5(1), 153–164 (2011)

    Article  MathSciNet  Google Scholar 

  34. Li, F., Lee, M.L., Hsu, W., Tan, W.-C.: Linking temporal records for profiling entities. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 593–605, New York, NY, USA. ACM (2015)

  35. Nguyen, D.M., Le Thi, H.A., Dinh, T.P.: Solving the multidimensional assignment problem by a cross-entropy method. J. Comb. Optim. 27(4), 808–823 (2014)

    Article  MathSciNet  Google Scholar 

  36. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)

    Article  Google Scholar 

  37. Pasiliao, E.L., Jr.: Local neighborhoods for the multidimensional assignment problem. In: Dynamics of Information Systems, pp. 353–371. Springer (2010)

  38. Pierskalla, W.P.: The tri-substitution method for the three-dimensional assignment problem. CORS J. 5, 71–81 (1967)

    MATH  Google Scholar 

  39. Pierskalla, W.P.: Letter to the editor-the multidimensional assignment problem. Oper. Res. 16(2), 422–431 (1968)

    Article  Google Scholar 

  40. Poore, A., Rijavec, N., Liggins, M., Vannicola, V.: Data association problems posed as multidimensional assignment problems: problem formulation. In: Optical Engineering and Photonics in Aerospace Sensing, pp. 552–563. International Society for Optics and Photonics (1993)

  41. Puglisi, S., Rebollo-Monedero, D., Forné, J.: On Web user tracking: how third-party http requests track users’ browsing patterns for personalised advertising. In: 2016 Mediterranean Ad Hoc Networking Workshop (Med-Hoc-Net), pp. 1–6 (2016)

  42. Pusztaszeri, J.-F.: The nonlinear assignment problem in experimental high energy physics. In: Nonlinear Assignment Problems, pp. 55–89. Springer (2000)

  43. Pusztaszeri, J.-F., Rensing, P.E., Liebling, T.M.: Tracking elementary particles near their primary vertex: a combinatorial approach. J. Global Optim. 9(1), 41–64 (1996)

    Article  MathSciNet  Google Scholar 

  44. Riederer, C., Kim, Y., Chaintreau, A., Korula, N., Lattanzi, S.: Linking users across domains with location data: theory and validation. In: Proceedings of the 25th International Conference on World Wide Web, WWW ’16, pp. 707–719, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2016)

  45. Sagi, T., Gal, A., Barkol, O., Bergman, R., Avram, A.: Multi-source uncertain entity resolution at yad vashem: Transforming holocaust victim reports into people. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 807–819, New York, NY, USA. ACM (2016)

  46. Semenov, A., Veijalainen, J.: A modelling framework for social media monitoring. Int. J. Web Eng. Technol. 8(3), 217–249 (2013). (PMID: 57226)

    Article  Google Scholar 

  47. Silva, R.M., Resende, M.G., Pardalos, P.M.: Finding multiple roots of a box-constrained system of nonlinear equations with a biased random-key genetic algorithm. J. Global Optim. 60(2), 289–306 (2014)

    Article  MathSciNet  Google Scholar 

  48. Tang, J., Zafarani, R., Shu, K., Wang, S., Liu, H.: User identity linkage across online social netwroks: A review. In: To appear in SIGKDD Explorations (2016)

  49. Vogiatzis, C., Pasiliao, E.L., Pardalos, P.M.: Graph partitions for the multidimensional assignment problem. Comput. Optim. Appl. 58(1), 205–224 (2014)

    Article  MathSciNet  Google Scholar 

  50. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)

    Article  Google Scholar 

  51. Ye, T., Lauw, H.W.: Structural constraints for multipartite entity resolution with markov logic network. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pp. 1691–1694, New York, NY, USA. ACM (2015)

  52. Zhang, D., Rubinstein, B.I.P., Gemmell, J.: Principled graph matching algorithms for integrating multiple data sources. IEEE Trans. Knowl. Data Eng. 27(10), 2784–2796 (2015)

    Article  Google Scholar 

  53. Zhang, J., Yu, P.S.: Multiple anonymized social networks alignment. In: 2015 IEEE International Conference on Data Mining, pp. 599–608 (2015)

  54. Zhou, X., Liang, X., Zhang, H., Ma, Y.: Cross-platform identification of anonymous identical users in multiple social media networks. IEEE Trans. Knowl. Data Eng. 28(2), 411–424 (2016)

    Article  Google Scholar 

  55. Zhou, X.-H., Gao, S.: Confidence intervals for the log-normal mean. Stat. Med. 16(7), 783–790 (1997)

    Article  Google Scholar 

Download references

Acknowledgements

A. Kammerdiner was supported by the AFRL (National Research Council Fellowship). The work of A. Semenov was funded in part by the AFRL European Office of Aerospace Research and Development (Grant FA9550-17-1-0030, University of Jyvaskyla, Finland).

Author information

Authors and Affiliations

Authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kammerdiner, A., Semenov, A. & Pasiliao, E.L. Multidimensional Assignment Problem for Multipartite Entity Resolution. J Glob Optim 84, 491–523 (2022). https://doi.org/10.1007/s10898-022-01141-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-022-01141-3

Keywords

Navigation