[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-981-97-5569-1_20guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models

Published: 13 December 2024 Publication History

Abstract

Entity resolution is a critical problem in data integration. Recently, approaches based on pre-trained language models have shown leading performance and have become the mainstream solution. When facing entities with long-text descriptions, considering that language models have limited input context length, existing approaches tend to use the syntax-based way, e.g., TF-IDF or auxiliary model to highlight the descriptions to be input into the matcher. However, such naive filtering approaches lack the interaction with the matching phase, thus may drop key information for calculating the semantic similarities and affect the final matching quality. To solve the problem of long-text entity resolution, we propose a novel framework called CoTer, which follows a chunk-then-aggregate architecture. CoTer firstly chunks the long-text descriptions to be input into the encoder to get the chunked representations. And then it implicitly highlights the semantically key information in chunked representations by injecting the Chain-of-Thought reasoning knowledge from a Large Language Model. Finally, CoTer fuses the chunked representations and reasoning knowledge in the decoder to output the matching probabilities. Extensive experiments show that CoTer demonstrates leading performance compared with state-of-the-art solutions.

References

[1]
Cheng Fu, Xianpei Han, Le Sun 0001, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. End-to-end multi-perspective matching for entity resolution. In IJCAI, pages 4961–4967, 2019.
[2]
Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang.Grapher: token-centric entity resolution with graph convolutional neural networks.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8172–8179, 2020.
[3]
Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3665–3671, 2021.
[4]
Ursin Brunner and Kurt Stockinger. Entity matching with transformer architectures-a step forward in data integration. In 23rd International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020, pages 463–473. OpenProceedings, 2020.
[5]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584, 2020.
[6]
Ralph Peeters and Christian Bizer.Dual-objective fine-tuning of bert for entity matching.Proceedings of the VLDB Endowment, 14:1913–1921, 2021.
[7]
Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang.Improving the efficiency and effectiveness for bert-based entity resolution.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13226–13233, 2021.
[8]
Wenzhou Dou, Derong Shen, Xiangmin Zhou, Tiezheng Nie, Yue Kou, Hang Cui, and Ge Yu. Soft target-enhanced matching framework for deep entity matching. 2023.
[9]
Liri Fang, Lan Li, Yiren Liu, Vetle I Torvik, and Bertram Ludäscher. Kaer: A knowledge augmented pre-trained language model for entity resolution. arXiv preprint arXiv:2301.04770, 2023.
[10]
Jin Wang, Yuliang Li, Wataru Hirota, and Eser Kandogan. Machop: An end-to-end generalized entity matching framework. In Proceedings of the Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pages 1–10, 2022.
[11]
Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.
[12]
Maor Ivgi, Uri Shaham, and Jonathan Berant.Efficient long-text understanding with short-text models.Transactions of the Association for Computational Linguistics,11:284–299, 2023.
[13]
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
[14]
Zhenghao Liu, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, and Ge Yu. Text matching improves sequential recommendation by reducing popularity biases. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1534–1544, 2023.
[15]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
[16]
Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. Towards open-world recommendation with knowledge augmentation from large language models. arXiv preprint arXiv:2306.10933, 2023.
[17]
Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
[18]
Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, and Tamás Sarlós. Optimal hashing schemes for entity matching. In Proceedings of the 22nd international conference on world wide web, pages 295–306, 2013.
[19]
Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. Nadeef/er: Generic and interactive entity resolution. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1071–1074, 2014.
[20]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 11(2):189–202, 2017.
[21]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How similar is similar. Proceedings of the VLDB Endowment, 4(10):622–633, 2011.
[22]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 601–612, 2014.
[23]
Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. arXiv preprint arXiv:1109.6881, 2011.
[24]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927, 2012.
[25]
Joty MESTS and MON Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11), 2018.
[26]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.
[27]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
[28]
Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749, 2023.
[29]
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R Gormley. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
[30]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[31]
Jin Wang, Yuliang Li, and Wataru Hirota. Machamp: A generalized entity matching benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 4633–4642, 2021.
[32]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34, 2018.
[33]
Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13):1581–1584, 2016.

Index Terms

  1. Towards Long-Text Entity Resolution with Chain-of-Thought Knowledge Augmentation from Large Language Models
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Please enable JavaScript to view thecomments powered by Disqus.

              Information & Contributors

              Information

              Published In

              cover image Guide Proceedings
              Database Systems for Advanced Applications: 29th International Conference, DASFAA 2024, Gifu, Japan, July 2–5, 2024, Proceedings, Part V
              Jul 2024
              561 pages
              ISBN:978-981-97-5568-4
              DOI:10.1007/978-981-97-5569-1
              • Editors:
              • Makoto Onizuka,
              • Jae-Gil Lee,
              • Yongxin Tong,
              • Chuan Xiao,
              • Yoshiharu Ishikawa,
              • Sihem Amer-Yahia,
              • H. V. Jagadish,
              • Kejing Lu

              Publisher

              Springer-Verlag

              Berlin, Heidelberg

              Publication History

              Published: 13 December 2024

              Author Tags

              1. Entity Resolution
              2. Entity Matching
              3. Transformer
              4. Pre-trained Language Model
              5. Large Language Model
              6. Knowledge Augmentation
              7. Long Text Modeling

              Qualifiers

              • Article

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • 0
                Total Citations
              • 0
                Total Downloads
              • Downloads (Last 12 months)0
              • Downloads (Last 6 weeks)0
              Reflects downloads up to 20 Jan 2025

              Other Metrics

              Citations

              View Options

              View options

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media