[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution

Published: 04 December 2024 Publication History

Abstract

Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.

References

[1]
Christophides V, Efthymiou V, Palpanas T, Papadakis G, and Stefanidis K An overview of end-to-end entity resolution for big data ACM CSUR 2021 53 6 1-42
[2]
Dong XL and Srivastava D Big data integration PVLDB 2013 6 11 1188-1189
[3]
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)
[4]
Christen P Data Matching 2012 Berlin Springer
[5]
Getoor L and Machanavajjhala A Entity resolution: theory, practice & open challenges PVLDB 2012 5 12 2018-2019
[6]
Papadakis G, Skoutas D, Thanos E, and Palpanas T Blocking and filtering techniques for entity resolution: a survey ACM CSUR 2021 53 2 1-42
[7]
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool (2021)
[8]
Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan & Claypool (2020)
[9]
Thirumuruganathan S, Li H, Tang N, Ouzzani M, Govind Y, Paulsen D, Fung G, and Doan A Deep learning for blocking in entity matching: a design space exploration PVLDB 2021 14 11 2459-2472
[10]
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)
[11]
Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT, pp. 463–473 (2020)
[12]
Ebraheem M, Thirumuruganathan S, Joty SR, Ouzzani M, and Tang N Distributed representations of tuples for entity resolution PVLDB 2018 11 11 1454-1467
[13]
Johnson J, Douze M, and Jégou H Billion-scale similarity search with gpus IEEE Trans. Big Data 2021 7 3 535-547
[14]
Tu J, Fan J, Tang N, Wang P, Li G, Du X, Jia X, and Gao S Unicorn: a unified multi-tasking model for supporting matching tasks in data integration SIGMOD 2023 1 1 1-26
[15]
Li Y, Li J, Suhara Y, Doan A, and Tan W Deep entity matching with pre-trained language models Proc. VLDB Endow. 2020 14 1 50-60
[16]
Papadakis G, Efthymiou V, Thanos E, Hassanzadeh O, and Christen P An analysis of one-to-one matching algorithms for entity resolution VLDB J. 2023 32 6 1369-1400
[17]
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)
[18]
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638 (2019)
[19]
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. In: IAAI, pp. 8172–8179 (2020)
[20]
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327 (2020)
[21]
Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.: Multi-context attention for entity matching. In: WWW, pp. 2634–2640 (2020)
[22]
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)
[23]
Yao, Z., Li, C., Dong, T., Lv, X., Yu, J., Hou, L., Li, J., Zhang, Y., Dai, Z.: Interpretable and low-resource entity matching via decoupling feature learning from decision making. In: ACL/IJCNLP, pp. 2770–2781 (2021)
[24]
Peeters R and Bizer C Dual-objective fine-tuning of BERT for entity matching PVLDB 2021 14 1913-1921
[25]
Paganelli, M., Del Buono, F., Marco, P., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: EDBT, pp. 325–330 (2021)
[26]
Chen, R., Shen, Y., Zhang, Y.: GNEM: a generic one-to-set neural entity matching framework. In: WWW, pp. 1686–1694 (2020)
[27]
Li Y, Li J, Suhara Y, Doan A, and Tan W Deep entity matching with pre-trained language models PVLDB 2020 14 1 50-60
[28]
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
[29]
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)
[30]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. NeurIPS, vol. 26 (2013)
[31]
Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information TACL 2017 5 135-146
[32]
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for Computational Linguistics, pp. 4171–4186 (2019)
[33]
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR. OpenReview.net (2020)
[34]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[35]
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
[36]
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. NeurIPS, vol. 32 (2019)
[37]
Song K, Tan X, Qin T, Lu J, and Liu T-Y Mpnet: masked and permuted pre-training for language understanding NeurIPS 2020 33 16857-16867
[38]
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, and Liu PJ Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. 2020 21 140 1-67
[39]
Wang W, Wei F, Dong L, Bao H, Yang N, and Zhou M Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers NeurIPS 2020 33 5776-5788
[40]
Peeters, R., Bizer, C.: Entity matching using large language models. arXiv preprint arXiv:2310.11244 (2023)
[41]
Zeakis A, Papadakis G, Skoutas D, and Koubarakis M Pre-trained embeddings for entity resolution: an experimental analysis Proc. VLDB Endow. 2023 16 9 2225-2238
[42]
Mugeni JB and Amagasa T A graph-based blocking approach for entity matching using contrastively learned embeddings ACM SIGAPP Appl. Comput. Rev. 2023 22 4 37-46
[43]
Paulsen D, Govind Y, and Doan A Sparkly: a simple yet surprisingly strong tf/idf blocker for entity matching PVLDB 2023 16 6 1507-1519
[44]
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE, pp. 653–666 (2023)
[45]
Brinkmann, A., Shraga, R., Bizer, C.: Sc-block: supervised contrastive blocking within entity resolution pipelines. In: ESWC, pp. 121–142 (2024)
[46]
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164 (2020)
[47]
Ge C, Wang P, Chen L, Liu X, Zheng B, and Gao Y Collaborem: a self-supervised entity matching framework using multi-features collaboration TKDE 2021 35 12 12139-12152
[48]
Peeters, R., Bizer, C.: Using chatgpt for entity matching. In: European Conference on Advances in Databases and Information Systems, pp. 221–230 (2023)
[49]
Narayan A, Chami I, Orr LJ, and Ré C Can foundation models wrangle your data? Proc. VLDB Endow. 2022 16 4 738-746
[50]
Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Jellyfish: a large language model for data preprocessing. arXiv preprint arXiv:2312.01678 (2023)
[51]
Peeters R and Bizer C Supervised contrastive learning for product matching Companion Proc. Web Conf. 2022 2022 248-251
[52]
Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. In ICDE, pp. 1502–1515 (2023)
[53]
Yao, D., Gu, Y., Cong, G., Jin, H., Lv, X.: Entity resolution with hierarchical graph attention networks. In: SIGMOD, pp. 429–442 (2022)
[54]
Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., Chang, M., Yang, Y.: Large dual encoders are generalizable retrievers. In: EMNLP. Association for Computational Linguistics, pp. 9844–9855 (2022)
[55]
Paganelli M, Buono FD, Baraldi A, and Guerra F Analyzing how BERT performs entity matching PVLDB 2022 15 8 1726-1738
[56]
Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. CoRR, vol. abs/2003.07278 (2020)
[57]
Trummer I From BERT to GPT-3 codex: harnessing the potential of very large language models for data management PVLDB 2022 15 12 3770-3773
[58]
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 (2017)
[59]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics, pp. 353–355 (2018)
[60]
Akbarian Rastaghi, M., Kamalloo, E., Rafiei, D.: Probing the robustness of pre-trained language models for entity matching. In: CIKM, pp. 3786–3790 (2022)
[61]
Peeters, R., Der, R.C., Bizer, C.: WDC products: a multi-dimensional entity matching benchmark. In: EDBT. OpenProceedings.org, pp. 22–33 (2024)
[62]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS, vol. 30 (2017)
[63]
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NeurIPS, vol. 27 (2014)
[64]
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
[65]
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7, arXiv preprint arXiv:1503.02531 (2015)
[66]
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (Poster) (2015)
[67]
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (1). Association for Computational Linguistics, pp. 3980–3990 (2019)
[68]
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling BERT for natural language understanding. In: EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020. Association for Computational Linguistics, pp. 4163–4174 (2020)
[69]
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices, pp. 2158–2170 (2020)
[70]
Köpcke H, Thor A, and Rahm E Evaluation of entity resolution approaches on real-world match problems PVLDB 2010 3 1 484-493
[71]
Obraczka, D., Schuchart, J., Rahm, E.: EAGER: embedding-assisted entity resolution for knowledge graphs. CoRR, vol. abs/2101.06126 (2021)
[72]
Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: WSDM, pp. 535–544 (2011)
[73]
Papadakis G, Svirsky J, Gal A, and Palpanas T Comparative analysis of approximate blocking techniques for entity resolution PVLDB 2016 9 9 684-695
[74]
Kenig B and Gal A Mfiblocks: an effective blocking algorithm for entity resolution Inf. Syst. 2013 38 6 908-926
[75]
Christen P A survey of indexing techniques for scalable record linkage and deduplication TKDE 2012 24 9 1537-1555
[76]
Christen, P.: “Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface,” in SIGKDD, pp. 1065–1068
[77]
Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: ICDE, pp. 3435–3448 (2024)
[78]
Papadakis G, Alexiou G, Papastefanatos G, and Koutrika G Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data PVLDB 2015 9 4 312-323
[79]
Li W, Zhang Y, Sun Y, Wang W, Li M, Zhang W, and Lin X Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement TKDE 2019 32 8 1475-1488
[80]
Malkov YA and Yashunin DA Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Trans. Pattern Anal. Mach. Intell. 2020 42 4 824-836
[81]
Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD, pp. 572–580 (2013)
[82]
Konda P et al. Magellan: toward building entity matching management systems Proc. VLDB Endow. 2016 9 12 1197-1208
[83]
Zeakis A, Skoutas D, Sacharidis D, Papapetrou O, and Koubarakis M TokenJoin: efficient filtering for set similarity join with maximumweighted bipartite matching PVLDB 2022 16 4 790-802

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 34, Issue 1
Jan 2025
319 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 04 December 2024
Accepted: 07 October 2024
Revision received: 26 September 2024
Received: 10 January 2024

Author Tags

  1. Blocking
  2. Entity matching
  3. Entity resolution
  4. Language models

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media