More Web Proxy on the site http://driver.im/

research-article

An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution

Authors: Alexandros Zeakis,

George Papadakis,

Dimitrios Skoutas,

Manolis KoubarakisAuthors Info & Claims

The VLDB Journal, Volume 34, Issue 1

https://doi.org/10.1007/s00778-024-00879-4

Published: 04 December 2024 Publication History

Abstract

Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.

References

[1]

Christophides V, Efthymiou V, Palpanas T, Papadakis G, and Stefanidis K An overview of end-to-end entity resolution for big data ACM CSUR 2021 53 6 1-42

Digital Library

[2]

Dong XL and Srivastava D Big data integration PVLDB 2013 6 11 1188-1189

Digital Library

[3]

Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)

[4]

Christen P Data Matching 2012 Berlin Springer

[5]

Getoor L and Machanavajjhala A Entity resolution: theory, practice & open challenges PVLDB 2012 5 12 2018-2019

Digital Library

[6]

Papadakis G, Skoutas D, Thanos E, and Palpanas T Blocking and filtering techniques for entity resolution: a survey ACM CSUR 2021 53 2 1-42

Digital Library

[7]

Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool (2021)

[8]

Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan & Claypool (2020)

[9]

Thirumuruganathan S, Li H, Tang N, Ouzzani M, Govind Y, Paulsen D, Fung G, and Doan A Deep learning for blocking in entity matching: a design space exploration PVLDB 2021 14 11 2459-2472

Digital Library

[10]

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)

[11]

Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT, pp. 463–473 (2020)

[12]

Ebraheem M, Thirumuruganathan S, Joty SR, Ouzzani M, and Tang N Distributed representations of tuples for entity resolution PVLDB 2018 11 11 1454-1467

[13]

Johnson J, Douze M, and Jégou H Billion-scale similarity search with gpus IEEE Trans. Big Data 2021 7 3 535-547

[14]

Tu J, Fan J, Tang N, Wang P, Li G, Du X, Jia X, and Gao S Unicorn: a unified multi-tasking model for supporting matching tasks in data integration SIGMOD 2023 1 1 1-26

[15]

Li Y, Li J, Suhara Y, Doan A, and Tan W Deep entity matching with pre-trained language models Proc. VLDB Endow. 2020 14 1 50-60

Digital Library

[16]

Papadakis G, Efthymiou V, Thanos E, Hassanzadeh O, and Christen P An analysis of one-to-one matching algorithms for entity resolution VLDB J. 2023 32 6 1369-1400

Digital Library

[17]

Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)

[18]

Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638 (2019)

[19]

Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. In: IAAI, pp. 8172–8179 (2020)

[20]

Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327 (2020)

[21]

Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.: Multi-context attention for entity matching. In: WWW, pp. 2634–2640 (2020)

[22]

Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)

[23]

Yao, Z., Li, C., Dong, T., Lv, X., Yu, J., Hou, L., Li, J., Zhang, Y., Dai, Z.: Interpretable and low-resource entity matching via decoupling feature learning from decision making. In: ACL/IJCNLP, pp. 2770–2781 (2021)

[24]

Peeters R and Bizer C Dual-objective fine-tuning of BERT for entity matching PVLDB 2021 14 1913-1921

Digital Library

[25]

Paganelli, M., Del Buono, F., Marco, P., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: EDBT, pp. 325–330 (2021)

[26]

Chen, R., Shen, Y., Zhang, Y.: GNEM: a generic one-to-set neural entity matching framework. In: WWW, pp. 1686–1694 (2020)

[27]

Li Y, Li J, Suhara Y, Doan A, and Tan W Deep entity matching with pre-trained language models PVLDB 2020 14 1 50-60

Digital Library

[28]

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

[29]

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)

[30]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. NeurIPS, vol. 26 (2013)

[31]

Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information TACL 2017 5 135-146

[32]

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for Computational Linguistics, pp. 4171–4186 (2019)

[33]

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR. OpenReview.net (2020)

[34]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

[35]

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

[36]

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. NeurIPS, vol. 32 (2019)

[37]

Song K, Tan X, Qin T, Lu J, and Liu T-Y Mpnet: masked and permuted pre-training for language understanding NeurIPS 2020 33 16857-16867

[38]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, and Liu PJ Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. 2020 21 140 1-67

[39]

Wang W, Wei F, Dong L, Bao H, Yang N, and Zhou M Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers NeurIPS 2020 33 5776-5788

[40]

Peeters, R., Bizer, C.: Entity matching using large language models. arXiv preprint arXiv:2310.11244 (2023)

[41]

Zeakis A, Papadakis G, Skoutas D, and Koubarakis M Pre-trained embeddings for entity resolution: an experimental analysis Proc. VLDB Endow. 2023 16 9 2225-2238

Digital Library

[42]

Mugeni JB and Amagasa T A graph-based blocking approach for entity matching using contrastively learned embeddings ACM SIGAPP Appl. Comput. Rev. 2023 22 4 37-46

Digital Library

[43]

Paulsen D, Govind Y, and Doan A Sparkly: a simple yet surprisingly strong tf/idf blocker for entity matching PVLDB 2023 16 6 1507-1519

Digital Library

[44]

Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE, pp. 653–666 (2023)

[45]

Brinkmann, A., Shraga, R., Bizer, C.: Sc-block: supervised contrastive blocking within entity resolution pipelines. In: ESWC, pp. 121–142 (2024)

[46]

Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164 (2020)

[47]

Ge C, Wang P, Chen L, Liu X, Zheng B, and Gao Y Collaborem: a self-supervised entity matching framework using multi-features collaboration TKDE 2021 35 12 12139-12152

[48]

Peeters, R., Bizer, C.: Using chatgpt for entity matching. In: European Conference on Advances in Databases and Information Systems, pp. 221–230 (2023)

[49]

Narayan A, Chami I, Orr LJ, and Ré C Can foundation models wrangle your data? Proc. VLDB Endow. 2022 16 4 738-746

Digital Library

[50]

Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Jellyfish: a large language model for data preprocessing. arXiv preprint arXiv:2312.01678 (2023)

[51]

Peeters R and Bizer C Supervised contrastive learning for product matching Companion Proc. Web Conf. 2022 2022 248-251

[52]

Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. In ICDE, pp. 1502–1515 (2023)

[53]

Yao, D., Gu, Y., Cong, G., Jin, H., Lv, X.: Entity resolution with hierarchical graph attention networks. In: SIGMOD, pp. 429–442 (2022)

[54]

Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., Chang, M., Yang, Y.: Large dual encoders are generalizable retrievers. In: EMNLP. Association for Computational Linguistics, pp. 9844–9855 (2022)

[55]

Paganelli M, Buono FD, Baraldi A, and Guerra F Analyzing how BERT performs entity matching PVLDB 2022 15 8 1726-1738

Digital Library

[56]

Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. CoRR, vol. abs/2003.07278 (2020)

[57]

Trummer I From BERT to GPT-3 codex: harnessing the potential of very large language models for data management PVLDB 2022 15 12 3770-3773

Digital Library

[58]

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 (2017)

[59]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics, pp. 353–355 (2018)

[60]

Akbarian Rastaghi, M., Kamalloo, E., Rafiei, D.: Probing the robustness of pre-trained language models for entity matching. In: CIKM, pp. 3786–3790 (2022)

[61]

Peeters, R., Der, R.C., Bizer, C.: WDC products: a multi-dimensional entity matching benchmark. In: EDBT. OpenProceedings.org, pp. 22–33 (2024)

[62]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS, vol. 30 (2017)

[63]

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NeurIPS, vol. 27 (2014)

[64]

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

[65]

Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7, arXiv preprint arXiv:1503.02531 (2015)

[66]

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (Poster) (2015)

[67]

Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (1). Association for Computational Linguistics, pp. 3980–3990 (2019)

[68]

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling BERT for natural language understanding. In: EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020. Association for Computational Linguistics, pp. 4163–4174 (2020)

[69]

Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices, pp. 2158–2170 (2020)

[70]

Köpcke H, Thor A, and Rahm E Evaluation of entity resolution approaches on real-world match problems PVLDB 2010 3 1 484-493

Digital Library

[71]

Obraczka, D., Schuchart, J., Rahm, E.: EAGER: embedding-assisted entity resolution for knowledge graphs. CoRR, vol. abs/2101.06126 (2021)

[72]

Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: WSDM, pp. 535–544 (2011)

[73]

Papadakis G, Svirsky J, Gal A, and Palpanas T Comparative analysis of approximate blocking techniques for entity resolution PVLDB 2016 9 9 684-695

Digital Library

[74]

Kenig B and Gal A Mfiblocks: an effective blocking algorithm for entity resolution Inf. Syst. 2013 38 6 908-926

Digital Library

[75]

Christen P A survey of indexing techniques for scalable record linkage and deduplication TKDE 2012 24 9 1537-1555

Digital Library

[76]

Christen, P.: “Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface,” in SIGKDD, pp. 1065–1068

[77]

Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: ICDE, pp. 3435–3448 (2024)

[78]

Papadakis G, Alexiou G, Papastefanatos G, and Koutrika G Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data PVLDB 2015 9 4 312-323

Digital Library

[79]

Li W, Zhang Y, Sun Y, Wang W, Li M, Zhang W, and Lin X Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement TKDE 2019 32 8 1475-1488

[80]

Malkov YA and Yashunin DA Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Trans. Pattern Anal. Mach. Intell. 2020 42 4 824-836

Digital Library

[81]

Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD, pp. 572–580 (2013)

[82]

Konda P et al. Magellan: toward building entity matching management systems Proc. VLDB Endow. 2016 9 12 1197-1208

Digital Library

[83]

Zeakis A, Skoutas D, Sacharidis D, Papapetrou O, and Koubarakis M TokenJoin: efficient filtering for set similarity join with maximumweighted bipartite matching PVLDB 2022 16 4 790-802

Digital Library

Index Terms

An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
FlexER: Flexible Entity Resolution for Multiple Intents
PACMMOD

Entity resolution, a longstanding problem of data cleaning and integration, aims at identifying data records that represent the same real-world entity. Existing approaches treat entity resolution as a universal task, assuming the existence of a single ...
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 34, Issue 1

Jan 2025

319 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 04 December 2024

Accepted: 07 October 2024

Revision received: 26 September 2024

Received: 10 January 2024

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents