More Web Proxy on the site http://driver.im/

Article

Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models

Authors:

Yue KouAuthors Info & Claims

Database Systems for Advanced Applications: 29th International Conference, DASFAA 2024, Gifu, Japan, July 2–5, 2024, Proceedings, Part V

Pages 322 - 336

https://doi.org/10.1007/978-981-97-5569-1_20

Published: 13 December 2024 Publication History

Abstract

Entity resolution is a critical problem in data integration. Recently, approaches based on pre-trained language models have shown leading performance and have become the mainstream solution. When facing entities with long-text descriptions, considering that language models have limited input context length, existing approaches tend to use the syntax-based way, e.g., TF-IDF or auxiliary model to highlight the descriptions to be input into the matcher. However, such naive filtering approaches lack the interaction with the matching phase, thus may drop key information for calculating the semantic similarities and affect the final matching quality. To solve the problem of long-text entity resolution, we propose a novel framework called CoTer, which follows a chunk-then-aggregate architecture. CoTer firstly chunks the long-text descriptions to be input into the encoder to get the chunked representations. And then it implicitly highlights the semantically key information in chunked representations by injecting the Chain-of-Thought reasoning knowledge from a Large Language Model. Finally, CoTer fuses the chunked representations and reasoning knowledge in the decoder to output the matching probabilities. Extensive experiments show that CoTer demonstrates leading performance compared with state-of-the-art solutions.

References

[1]

Cheng Fu, Xianpei Han, Le Sun 0001, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. End-to-end multi-perspective matching for entity resolution. In IJCAI, pages 4961–4967, 2019.

[2]

Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang.Grapher: token-centric entity resolution with graph convolutional neural networks.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8172–8179, 2020.

[3]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3665–3671, 2021.

[4]

Ursin Brunner and Kurt Stockinger. Entity matching with transformer architectures-a step forward in data integration. In 23rd International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020, pages 463–473. OpenProceedings, 2020.

[5]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584, 2020.

[6]

Ralph Peeters and Christian Bizer.Dual-objective fine-tuning of bert for entity matching.Proceedings of the VLDB Endowment, 14:1913–1921, 2021.

[7]

Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang.Improving the efficiency and effectiveness for bert-based entity resolution.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13226–13233, 2021.

[8]

Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, and Ge Yu. Soft target-enhanced matching framework for deep entity matching. 2023.

[9]

Liri Fang, Lan Li, Yiren Liu, Vetle I Torvik, and Bertram Ludäscher. Kaer: A knowledge augmented pre-trained language model for entity resolution. arXiv preprint arXiv:2301.04770, 2023.

[10]

Jin Wang, Yuliang Li, Wataru Hirota, and Eser Kandogan. Machop: An end-to-end generalized entity matching framework. In Proceedings of the Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pages 1–10, 2022.

[11]

Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.

[12]

Maor Ivgi, Uri Shaham, and Jonathan Berant.Efficient long-text understanding with short-text models.Transactions of the Association for Computational Linguistics,11:284–299, 2023.

[13]

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.

[14]

Zhenghao Liu, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, and Ge Yu. Text matching improves sequential recommendation by reducing popularity biases. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1534–1544, 2023.

[15]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.

[16]

Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. Towards open-world recommendation with knowledge augmentation from large language models. arXiv preprint arXiv:2306.10933, 2023.

[17]

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.

[18]

Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, and Tamás Sarlós. Optimal hashing schemes for entity matching. In Proceedings of the 22nd international conference on world wide web, pages 295–306, 2013.

[19]

Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. Nadeef/er: Generic and interactive entity resolution. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1071–1074, 2014.

[20]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 11(2):189–202, 2017.

[21]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How similar is similar. Proceedings of the VLDB Endowment, 4(10):622–633, 2011.

[22]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 601–612, 2014.

[23]

Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. arXiv preprint arXiv:1109.6881, 2011.

[24]

Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927, 2012.

[25]

Joty MESTS and MON Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11), 2018.

[26]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.

[27]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.

[28]

Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749, 2023.

[29]

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R Gormley. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.

[30]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[31]

Jin Wang, Yuliang Li, and Wataru Hirota. Machamp: A generalized entity matching benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 4633–4642, 2021.

[32]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.

[33]

Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13):1581–1584, 2016.

Index Terms

Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Data management systems
    1. Database design and models
    2. Information integration
  2. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Swash: A collective personal name matching framework
Highlights
- The collective information of names, e.g. token frequency, can improve matching.
Abstract
Having a unique personal identifier is a prerequisite to run person-centric analytical queries and data mining tasks, such as fraud detection, expert finding, and credit scoring. Personal names are the most commonly used identifier of ...
ZeroER: Entity Resolution using Zero Labeled Examples
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of ...
Cross-lingual entity matching and infobox alignment in Wikipedia

Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever-increasing number of articles feature so-called infoboxes, which provide factual information about the articles' subjects. As the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Database Systems for Advanced Applications: 29th International Conference, DASFAA 2024, Gifu, Japan, July 2–5, 2024, Proceedings, Part V

Jul 2024

561 pages

ISBN:978-981-97-5568-4

DOI:10.1007/978-981-97-5569-1

Editors:
Makoto Onizuka
Osaka University, Suita, Osaka, Japan
,
Jae-Gil Lee
KAIST, Daejeon, Korea (Republic of)
,
Yongxin Tong
https://ror.org/00wk2mp56Beihang University, Beijing, China
,
Chuan Xiao
Osaka University, Osaka, Japan
,
Yoshiharu Ishikawa
Nagoya University, Nagoya, Japan
,
Sihem Amer-Yahia
https://ror.org/05v727m31University of Grenoble Alpes, Saint-Martin d’Hères, France
,
H. V. Jagadish
https://ror.org/00jmfr291University of Michigan, Ann Arbor, MI, USA
,
Kejing Lu
https://ror.org/04chrp450Nagoya University, Nagoya, Japan

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 December 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents