A Survey of Text-Matching Techniques
<p>Graph of search sources versus number of papers.</p> "> Figure 2
<p>Graph of number of papers vs. time.</p> "> Figure 3
<p>Graph of the number of papers in relation to the distribution of countries.</p> "> Figure 4
<p>Graph of journals versus number of conferences and papers.</p> "> Figure 5
<p>Matching method overview diagram.</p> "> Figure 6
<p>Overview of logical relationships based on character-matching methods.</p> "> Figure 7
<p>Overview of logical relationships based on phrase-matching methods.</p> "> Figure 8
<p>Corpus-based matching method.</p> "> Figure 9
<p>Overview of logical relationships of recursive neural network-based matching methods.</p> "> Figure 10
<p>Overview of logical relations of sentence semantic interaction matching methods.</p> "> Figure 11
<p>Overview of logical relationships of graph structure matching methods.</p> "> Figure 12
<p>Overview of logical relations of sentence vector matching method based on large language model.</p> "> Figure 13
<p>Overview of the logical relationships of fine-grained matching methods.</p> ">
Abstract
:1. Introduction
2. Literature Review
3. Data Collection
4. Summary of Results
5. Text-Matching Methods
5.1. Text-Matching Methods
5.2. Based on the Corpus-Matching Method
5.3. Recursive Neural Network-Based Matching Method
5.4. Semantic Interaction Matching Methods
5.5. Matching Methods Based on Graph Structure
5.6. Sentence Vector Matching Method Based on Large Language Models
5.7. A Matching Method Based on a Large Language Model Utilizing Different Features
6. Comparison of Different Matching Models across Datasets and Evaluation Methods
7. Field-Specific Applications
8. Open Problems and Challenges
8.1. Limitations of State-of-the-Art Macro Models in Sentence-Matching Tasks
8.2. Model-Training Efficiency of Sentence-Matching Methods
8.3. Assertion Comprehension and Non-Matching Information Interference in Sentence Matching
8.4. Confusion or Omission in Handling Key Detail Facts
8.5. Diversity and Balance Challenges of Sentence-Matching Datasets
9. Future Research Directions
9.1. Future Research Direction I: The Problem of Sentence-Matching Model Publicity Driven by Commercial Interests and Solutions
9.2. Future Research Directions II: Optimization Strategies That Focus on Model-Training Efficiency for Sentence-Matching Methods
9.3. Future Research Directions III: Optimization Strategies Focusing on the Efficiency of Model Training for Sentence-Matching Methods
9.4. Future Research Direction IV: Difficulties and Breakthroughs in Processing Key Detail Facts in Sentence-Matching Tasks
9.5. Future Research Direction V: Addressing the Challenges of Diversity and Balance in Sentence-Matching Datasets
10. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hunt, J.W.; Szymanski, T.G. A Fast Algorithm for Computing Longest Common Subsequences. Commun. ACM 1977, 20, 350–353. [Google Scholar] [CrossRef]
- Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Winkler, W.E. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. 1990. Available online: https://files.eric.ed.gov/fulltext/ED325505.pdf (accessed on 21 May 2024).
- Dice, L.R. Measures of the Amount of Ecologic Association between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
- Jaccard, P. The Distribution of the Flora in the Alpine Zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
- Salton, G.; Buckley, C. Term Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Landauer, T.K.; Foltz, P.W.; Laham, D. An Introduction to Latent Semantic Analysis. Discourse Process. 1998, 25, 259–284. [Google Scholar] [CrossRef]
- Mueller, J.; Thyagarajan, A. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2786–2792. [Google Scholar]
- Neculoiu, P.; Versteegh, M.; Rotaru, M. Learning Text Similarity with Siamese Recurrent Networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, 11 August 2016; pp. 148–157. [Google Scholar]
- Lu, X.; Deng, Y.; Sun, T.; Gao, Y.; Feng, J.; Sun, X.; Sutcliffe, R. MKPM: Multi Keyword-Pair Matching for Natural Language Sentences. Appl. Intell. 2022, 52, 1878–1892. [Google Scholar] [CrossRef]
- Deng, Y.; Li, X.; Zhang, M.; Lu, X.; Sun, X. Enhanced Distance-Aware Self-Attention and Multi-Level Match for Sentence Semantic Matching. Neurocomputing 2022, 501, 174–187. [Google Scholar] [CrossRef]
- Kim, S.; Kang, I.; Kwak, N. Semantic Sentence Matching with Densely-Connected Recurrent and Co-Attentive Information. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6586–6593. [Google Scholar]
- Zhang, K.; Lv, G.; Wang, L.; Wu, L.; Chen, E.; Wu, F.; Xie, X. Drr-Net: Dynamic Re-Read Network for Sentence Semantic Matching. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7442–7449. [Google Scholar]
- Wang, Z.; Hamza, W.; Florian, R. Bilateral Multi-Perspective Matching for Natural Language Sentences. arXiv 2017, arXiv:1702.03814. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving Pre-Training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Yang, Z.L.; Dai, Z.H.; Yang, Y.M.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 21 May 2024).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Achiam, J.; Adler, J.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Ren, X.; Zhou, P.; Meng, X.; Huang, X.; Wang, Y.; Wang, W.; Li, P.; Zhang, X.; Podolskiy, A.; Arshinov, G. PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. arXiv 2023, arXiv:2303.10845. [Google Scholar]
- Zhang, K.; Wu, L.; Lv, G.Y.; Wang, M.; Chen, E.H.; Ruan, S.L.; Assoc Advancement Artificial, I. Making the Relation Matters: Relation of Relation Learning Network for Sentence Semantic Matching. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 14411–14419. [Google Scholar]
- Mysore, S.; Cohan, A.; Hope, T. Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity. arXiv 2021, arXiv:2111.08366. [Google Scholar]
- Zou, Y.; Liu, H.; Gui, T.; Wang, J.; Zhang, Q.; Tang, M.; Li, H.; Wang, D.; Assoc Computa, L. Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents. In Proceedings of the 60th Annual Meeting of the Association-for-Computational-Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 3622–3632. [Google Scholar]
- Yao, D.; Alghamdi, A.; Xia, Q.; Qu, X.; Duan, X.; Wang, Z.; Zheng, Y.; Huai, B.; Cheng, P.; Zhao, Z. A General and Flexible Multi-Concept Parsing Framework for Multilingual Semantic Matching. arXiv 2024, arXiv:2403.02975. [Google Scholar]
- Asha, S.; Krishna, S.T. Semantics-Based String Matching: A Review of Machine Learning Models. Int. J. Intell. Syst. 2024, 12, 347–356. [Google Scholar]
- Hu, W.; Dang, A.; Tan, Y. A Survey of State-of-the-Art Short Text Matching Algorithms. In Proceedings of the Data Mining and Big Data: 4th International Conference, Chiang Mai, Thailand, 26–30 July 2019; pp. 211–219. [Google Scholar]
- Wang, J.; Dong, Y. Measurement of Text Similarity: A Survey. Information 2020, 11, 421. [Google Scholar] [CrossRef]
- Deza, E.; Deza, M.M.; Deza, M.M.; Deza, E. Encyclopedia of Distances. In Encyclopedia of Distances; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–583. [Google Scholar]
- Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2013: 14th International Conference, Hefei, China, 20–23 October 2013; pp. 611–618. [Google Scholar]
- Sidorov, G.; Gelbukh, A.; Gómez-Adorno, H.; Pinto, D. Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model. Comput. Sist. 2014, 18, 491–504. [Google Scholar] [CrossRef]
- Dean, J.; Ghemawat, S. Mapreduce: Simplified Data Processing on Large Clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
- Bejan, I.; Sokolov, A.; Filippova, K. Make Every Example Count: On Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets. arXiv 2023, arXiv:2302.13959. [Google Scholar]
- Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Understanding Belief Propagation and Its Generalizations. Explor. Artif. Intell. New Millenn. 2003, 8, 0018–9448. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Robertson, S.E.; Walker, S. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In Proceedings of the International ACM Sigir Conference on Research and Development in Information Retrieval SIGIR ‘94, Dublin, Ireland, 3–6 July 1994; pp. 232–241. [Google Scholar]
- Katz, S. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 400–401. [Google Scholar] [CrossRef]
- Akritidis, L.; Alamaniotis, M.; Fevgas, A.; Tsompanopoulou, P.; Bozanis, P. Improving Hierarchical Short Text Clustering through Dominant Feature Learning. Int. J. Artif. Intell. Tools 2022, 31, 2250034. [Google Scholar] [CrossRef]
- Bulsari, A.B.; Saxen, H. A recurrent neural network model. In Proceedings of the 1992 International Conference (ICANN-92), Brighton, UK, 4–7 September 1992; pp. 1091–1094. [Google Scholar]
- Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Levy, O.; Goldberg, Y. Dependency-Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Tutorials, Baltimore, MD, USA, 22 June 2014; pp. 302–308. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 1188–1196. [Google Scholar]
- Chen, M. Efficient Vector Representation for Documents through Corruption. arXiv 2017, arXiv:1707.02377. [Google Scholar]
- Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A.H.; Riedel, S. Language Models as Knowledge Bases? arXiv 2019, arXiv:1909.01066. [Google Scholar]
- Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Tabassum, A.; Patil, R.R. A Survey on Text Pre-Processing & Feature Extraction Techniques in Natural Language Processing. Int. Res. J. Eng. Technol. 2020, 7, 4864–4867. [Google Scholar]
- Elsafty, A. Document Similarity Using Dense Vector Representation. 2017. Available online: https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2017-ma-elsafty.pdf (accessed on 22 May 2024).
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M.; Furht, B. Text Data Augmentation for Deep Learning. J. Big Data 2021, 8, 34. [Google Scholar] [CrossRef]
- Pan, S.J.; Yang, Q.A. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, Y.; Xu, J.; Chen, Y. Deep Bi-Directional Interaction Network for Sentence Matching. Appl. Intell. 2021, 51, 4305–4329. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V.; Int Speech Commun, A. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- Peng, S.; Cui, H.; Xie, N.; Li, S.; Zhang, J.; Li, X. Enhanced-RCNN: An Efficient Method for Learning Sentence Similarity. In Proceedings of the Web Conference 2020: Proceedings of the World Wide Web Conference WWW 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2500–2506. [Google Scholar]
- Mahajan, P.; Uddin, S.; Hajati, F.; Moni, M.A. Ensemble Learning for Disease Prediction: A Review. Healthcare 2023, 11, 1808. [Google Scholar] [CrossRef]
- Zhu, G.; Iglesias, C.A. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Trans. Knowl. Data Eng. 2016, 29, 72–85. [Google Scholar] [CrossRef]
- Chen, L.; Zhao, Y.; Lyu, B.; Jin, L.; Chen, Z.; Zhu, S.; Yu, K. Neural Graph Matching Networks for Chinese Short Text Matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 6–8 July 2020; pp. 6152–6158. [Google Scholar]
- Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Ranzato, M.; Senior, A.; Tucker, P.; Yang, K. Large Scale Distributed Deep Networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- Goldar, P.; Rai, Y.; Kushwaha, S. A Review on Parallelization of Big Data Analysis and Processing. IJETCSE 2016, 23, 60–65. [Google Scholar]
- Pluščec, D.; Šnajder, J. Data Augmentation for Neural NLP. arXiv 2023, arXiv:2302.11412. [Google Scholar]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Palais des Congrès de Montréal Convention and Exhibition Center, Montreal, QC, Canada, 8–10 December 2015; Volume 28. [Google Scholar]
- Chen, Z.; Qu, Z.; Quan, Y.; Liu, L.; Ding, Y.; Xie, Y. Dynamic n: M Fine-Grained Structured Sparse Attention Mechanism. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, QC, Canada, 25 February–1 March 2023; pp. 369–379. [Google Scholar]
- Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. arXiv 2019, arXiv:1905.05950. [Google Scholar]
- Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
- Fedus, W.; Goodfellow, I.; Dai, A.M. Maskgan: Better Text Generation via Filling in the _. arXiv 2018, arXiv:1801.07736. [Google Scholar]
- Dai, Z.H.; Yang, Z.L.; Yang, Y.M.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Acl Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association-for-Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
- He, M.; Liu, Y.; Wu, B.; Yuan, J.; Wang, Y.; Huang, T.; Zhao, B. Efficient Multimodal Learning from Data-Centric Perspective. arXiv 2024, arXiv:2402.11530. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.Q.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 67. [Google Scholar]
- Vinyals, O.; Le, Q. A Neural Conversational Model. arXiv 2015, arXiv:1506.05869. [Google Scholar]
- Sahin, U.; Kucukkaya, I.E.; Toraman, C. ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection. arXiv 2023, arXiv:2307.14912. [Google Scholar]
- Neill, J.O. An Overview of Neural Network Compression. arXiv 2020, arXiv:2006.03669. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Bordia, S.; Bowman, S.R.; Assoc Computat, L. Identifying and Reducing Gender Bias in Word-Level Language Models. arXiv 2019, arXiv:1904.03035. [Google Scholar]
- Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv 2019, arXiv:1904.09751. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, Virtual Event, Toronto, ON, Canada, 3–10 March 2021; pp. 610–623. [Google Scholar]
- Treviso, M.; Lee, J.-U.; Ji, T.; van Aken, B.; Cao, Q.; Ciosici, M.R.; Hassid, M.; Heafield, K.; Hooker, S.; Raffel, C. Efficient Methods for Natural Language Processing: A Survey. Trans. Assoc. Comput. Linguist. 2023, 11, 826–860. [Google Scholar] [CrossRef]
- He, W.; Dai, Y.; Yang, M.; Sun, J.; Huang, F.; Si, L.; Li, Y. Space-3: Unified Dialog Model Pre-Training for Task-Oriented Dialog Understanding and Generation. arXiv 2022, arXiv:2209.06664. [Google Scholar]
- He, W.; Dai, Y.; Zheng, Y.; Wu, Y.; Cao, Z.; Liu, D.; Jiang, P.; Yang, M.; Huang, F.; Si, L. Galaxy: A Generative Pre-Trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 10749–10757. [Google Scholar]
- He, W.; Dai, Y.; Hui, B.; Yang, M.; Cao, Z.; Dong, J.; Huang, F.; Si, L.; Li, Y. Space-2: Tree-Structured Semi-Supervised Contrastive Pre-Training for Task-Oriented Dialog Understanding. arXiv 2022, arXiv:2209.06638. [Google Scholar]
- Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; Sayres, R. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2668–2677. [Google Scholar]
- Brundage, M.; Avin, S.; Clark, J.; Toner, H.; Eckersley, P.; Garfinkel, B.; Dafoe, A.; Scharre, P.; Zeitzoff, T.; Filar, B. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. arXiv 2018, arXiv:1802.07228. [Google Scholar]
- Lee, A.; Miranda, B.; Koyejo, S. Beyond Scale: The Diversity Coefficient as a Data Quality Metric Demonstrates LLMs Are Pre-Trained on Formally Diverse Data. arXiv 2023, arXiv:2306.13840. [Google Scholar]
- Mondal, R.; Tang, A.; Beckett, R.; Millstein, T.; Varghese, G. What Do LLMs Need to Synthesize Correct Router Configurations? arXiv 2023, arXiv:2307.04945. [Google Scholar]
- Mumtarin, M.; Chowdhury, M.S.; Wood, J. Large Language Models in Analyzing Crash Narratives—A Comparative Study of ChatGPT, BARD and GPT-4. arXiv 2023, arXiv:2308.13563. [Google Scholar]
- Tsai, C.-M.; Chao, C.-J.; Chang, Y.-C.; Kuo, C.-C.J.; Hsiao, A.; Shieh, A. Challenges and Opportunities in Medical Artificial Intelligence. APSIPA Trans. Signal Inf. Process. 2023, 12, e205. [Google Scholar] [CrossRef]
- Zhong, T.; Wei, Y.; Yang, L.; Wu, Z.; Liu, Z.; Wei, X.; Li, W.; Yao, J.; Ma, C.; Li, X. Chatabl: Abductive Learning via Natural Language Interaction with Chatgpt. arXiv 2023, arXiv:2304.11107. [Google Scholar]
- Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. Meta-Radiology 2023, 1, 100017. [Google Scholar] [CrossRef]
- Sellam, T.; Das, D.; Parikh, A.P. BLEURT: Learning Robust Metrics for Text Generation. arXiv 2020, arXiv:2004.04696. [Google Scholar]
- Rahm, E.; Do, H.H. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 2000, 23, 3–13. [Google Scholar]
- Candemir, S.; Nguyen, X.V.; Folio, L.R.; Prevedello, L.M. Training Strategies for Radiology Deep Learning Models in Data-Limited Scenarios. Radiol. Artif. Intell. 2021, 3, e210014. [Google Scholar] [CrossRef]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Khot, T.; Sabharwal, A.; Clark, P. Scitail: A Textual Entailment Dataset from Science Question Answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), Hilton New Orleans Riverside, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Xu, L.; Hu, H.; Zhang, X.; Li, L.; Cao, C.; Li, Y.; Xu, Y.; Sun, K.; Yu, D.; Yu, C. CLUE: A Chinese Language Understanding Evaluation Benchmark. arXiv 2020, arXiv:2004.05986. [Google Scholar]
- Hu, H.; Richardson, K.; Xu, L.; Li, L.; Kübler, S.; Moss, L.S. Ocnli: Original Chinese Natural Language Inference. arXiv 2020, arXiv:2010.05444. [Google Scholar]
- Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.; Tang, B. Lcqmc: A Large-Scale Chinese Question Matching Corpus. In Proceedings of the the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1952–1962. [Google Scholar]
- Chen, J.; Chen, Q.; Liu, X.; Yang, H.; Lu, D.; Tang, B. The Bq Corpus: A Large-Scale Domain-Specific Chinese Corpus for Sentence Semantic Equivalence Identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4946–4951. [Google Scholar]
- Shankar Iyer, N.D. First Quora Dataset Release: Question Pairs. 2017. Available online: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs (accessed on 21 May 2024).
- Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; Zamparelli, R. A SICK Cure for the Evaluation of Compositional Distributional Semantic Models. In Proceedings of the the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 216–223. [Google Scholar]
- Dolan, B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, Jeju Island, Korea, 14 October 2005; pp. 9–16. [Google Scholar]
- Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
- Zhu, G.; Iglesias, C.A. Exploiting Semantic Similarity for Named Entity Disambiguation in Knowledge Graphs. Expert. Syst. Appl. 2018, 101, 8–24. [Google Scholar] [CrossRef]
- Alkhidir, T.; Awad, E.; Alshamsi, A. Understanding the Progression of Educational Topics via Semantic Matching. arXiv 2024, arXiv:2403.05553. [Google Scholar]
- Hayden, K.A.; Eaton, S.E.; Pethrick, H.; Crossman, K.; Lenart, B.A.; Penaluna, L.-A. A Scoping Review of Text-Matching Software Used for Student Academic Integrity in Higher Education. Int. Educ. Res. 2021, 2021, 4834860. [Google Scholar] [CrossRef]
- Jeong, J.; Tian, K.; Li, A.; Hartung, S.; Adithan, S.; Behzadi, F.; Calle, J.; Osayande, D.; Pohlen, M.; Rajpurkar, P. Multimodal Image-Text Matching Improves Retrieval-Based Chest X-Ray Report Generation. In Proceedings of the Medical Imaging with Deep Learning, Paris, France, 3–5 July 2024; pp. 978–990. [Google Scholar]
- Luo, Y.-F.; Sun, W.; Rumshisky, A. A Hybrid Normalization Method for Medical Concepts in Clinical Narrative Using Semantic Matching. AMIA Jt. Summits Transl. Sci. Proc. 2019, 2019, 732. [Google Scholar]
- Wang, L.; Zhang, T.; Tian, J.; Lin, H. An Semantic Similarity Matching Method for Chinese Medical Question Text. In Proceedings of the 8th China Health Information Processing Conference, Hangzhou, China, 21–23 October 2022; pp. 82–94. [Google Scholar]
- Ajaj, S.H. AI-Driven Optimization of Job Advertisements through Knowledge-Based Techniques and Semantic Matching. Port-Said Eng. Res. J. 2024. [Google Scholar] [CrossRef]
- Ren, H.; Zhou, L.; Gao, Y. Policy Tourism and Economic Collaboration Among Local Governments: A Nonparametric Matching Model. Public Perform. Manag. Rev. 2024, 47, 476–504. [Google Scholar] [CrossRef]
- Gopalakrishnan, V.; Iyengar, S.P.; Madaan, A.; Rastogi, R.; Sengamedu, S. Matching Product Titles Using Web-Based Enrichment. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, HI, USA, 29 October–2 November 2012; pp. 605–614. [Google Scholar]
- Akritidis, L.; Fevgas, A.; Bozanis, P.; Makris, C. A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles. Artif. Intell. Rev. 2020, 53, 4777–4820. [Google Scholar] [CrossRef]
- De Bakker, M.; Frasincar, F.; Vandic, D. A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection. In Proceedings of the 25th International Conference on Advanced Information Systems Engineering, Valencia, Spain, 17–21 June 2013; pp. 149–161. [Google Scholar]
- Zheng, K.; Li, Z. An Image-Text Matching Method for Multi-Modal Robots. J. Organ. End User Comput. 2024, 36, 1–21. [Google Scholar] [CrossRef]
- Song, Y.; Wang, M.; Gao, W. Method for Retrieving Digital Agricultural Text Information Based on Local Matching. Symmetry 2020, 12, 1103. [Google Scholar] [CrossRef]
- Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar]
- Gong, J.; Fang, X.; Peng, J.; Zhao, Y.; Zhao, J.; Wang, C.; Li, Y.; Zhang, J.; Drew, S. MORE: Toward Improving Author Name Disambiguation in Academic Knowledge Graphs. Int. J. Mach. Learn. Cybern. 2024, 15, 37–50. [Google Scholar] [CrossRef]
- Arifoğlu, D. Historical Document Analysis Based on Word Matching. 2011. Available online: https://www.proquest.com/openview/b2c216ab3f6a907e7ad65bbe855fa8cd/1?pq-origsite=gscholar&cbl=2026366&diss=y (accessed on 23 May 2024).
- Li, Y. Unlocking Context Constraints of Llms: Enhancing Context Efficiency of Llms with Self-Information-Based Content Filtering. arXiv 2023, arXiv:2304.12102. [Google Scholar]
Author or Comparative Content | Large Language Modeling for Text Matching | The Main Ideas and Methods of the Matching Approach Are Outlined and Analyzed in Detail | Detailed Overview and Analysis of the Problems and Solutions of the Matching Methodology | Detailed Overview and Analysis of the Logical Trajectories of Typical Matching Methods | Comparison of Hardware Configurations and Training Details of Matching Methods | Performance Comparison of Text-Matching Models for Different Tasks Using Different Metrics | Overview and Analysis of Text-Matching Model Datasets and Evaluation Metrics | Text Matching Model Application Programming Interface Overview | Application of Text-Matching Models to Specific Domains | Main Research Work |
---|---|---|---|---|---|---|---|---|---|---|
Asha S et al. [30], (2024) | This study reviews string-matching methods based on artificial intelligence techniques, focusing on the application of neural networks, graph models, attention mechanisms, reinforcement learning, and generative models. The study demonstrates their advantages and challenges in practical applications and points out future research directions. | |||||||||
Hu W et al. [31], (2019) | This paper reviews short text matching algorithms based on neural networks in recent years, introduces models based on representation and word interaction, analyzes the applicable scenarios of each algorithm, and provides a quick-start guide to short text matching for beginners. | |||||||||
Wang Jet al. [32], (2020) | This paper systematically combs through the current research status of text similarity measurement, analyzes the advantages and disadvantages of the existing methods, constructs a more comprehensive classification and description system, and carries out an in-depth discussion of the aspects of text distance and text representation. Meanwhile, the development trend of text similarity measurement is summarized, to provide a reference for related research and applications | |||||||||
Ours | A detailed overview of the core ideas, methods, problems, and solutions for each phase of the matching model, comparing hardware and software configurations, performance per task, and datasets and evaluation methods for large models. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
Longest common substring matching [1] | Computing an array of thresholds for the two sequences, iteratively calculating each threshold, and then step-by-step determining the length of their longest common subsequence and ultimately finding the longest common subsequence. | Inserting, deleting, and testing the affiliation of elements in a set, where all elements are restricted to the first n integers of the key operations, which in turn reduces the running time of the algorithm. | Scalability of the algorithm. | Parallelization [36] and distributed processing. |
Editorial distance [2] | The analogous construction enables the correction of wounding, and the insertion and inversion of optimal codes. | Correcting, inserting, and reversing binary code. | (1) Error correction may be limited. (2) Insertion errors are more difficult to recognize and handle. (3) Detection and correction of inversion errors. (4) Computational complexity and resource overhead may be high. | (1) More powerful error-correcting codes, such as Hamming or BCH codes, can be used. At the same time, a combination of error-correcting and error-detecting codes can be used. (2) Special synchronization marks or delimiters can be designed to separate blocks of data. In addition, coding schemes can be developed that can tolerate insertion errors. (3) Errors can be detected using methods such as parity, checksum, or cyclic redundancy check (CRC), and corrected by complex error-correcting codes for multi-bit inversions. (4) It is possible to optimize the implementation of error correction algorithms, to select an error correction method that suits the needs of the system, and to consider the use of specialized error correction circuits or gas pedals in the hardware design. |
Jaro similarity [3] | Improving the record linkage method in the FellegiSunter model through string comparator metrics and enhanced decision rules. | (1) Uses the Jaro string comparator method, which is able to partially take into account differences between strings. (2) Proposed a method for adjusting the matching weights to trade off between perfect agreement and disagreement. (3) Use of the FellegiSunter model to deal with record linkage. | (1) Quality of empirical databases in record-matching applications. (2) Parameterization of generic string comparators. | (1) Enhance data cleaning and preprocessing, increase sample diversity [37], regularly update and maintain database. (2) Development of adaptive string comparators, introduction of partial and fuzzy matching mechanisms, optimization of parameters using domain knowledge, optimization and testing through iterations. |
Dice coefficient method [4] | Lies in the use of the dice coefficient to measure the degree of overlap between two text collections (e.g., a collection of words or word embedding vectors) and thus assess their semantic similarity. The higher the dice coefficient, the closer the two texts are semantically. | The method usually includes text preprocessing, word embedding transformation, constructing a collection of word embedding vectors for the text, and calculating the intersection and concatenation of these collections, and finally using the dice coefficient formula to derive the similarity score. | (1) Inability to deal effectively with synonyms or semantically similar words. (2) Ignoring contextual information, leading to misinterpretation of semantics. (3) Lower computational efficiency when dealing with large-scale text. | (1) Using word embedding models that can capture synonyms and semantically similar words, such as context-based word embedding (e.g., BERT) [17]. (2) Using models or methods that can take contextual information into account, such as semantic similarity models based on deep learning [9]. (3) For large-scale text data, approximation algorithms [38] or optimized data structures can be used to reduce the computation. |
Jaccard method [5] | The semantic similarity of two text collections is measured by calculating the ratio of their intersection to their concatenation, emphasizing the proportion of shared elements in the whole. | The Jaccard method first preprocesses the text to obtain the set of lexical items, then calculates the intersection and concatenation of these sets, and finally obtains the similarity score of the text by a ratio calculation. | (1) Insensitive to sparse text. (2) Ignores word weights. (3) Insensitive to word order. | (1) The word items are converted into word embedding vectors [7,39], and the similarity between the vectors is utilized to measure the semantic similarity of the text, so as to capture the semantic information in the sparse text. (2) When calculating the Jaccard similarity, the weight of the word items can be added, e.g., using methods such as TF-IDF [40] to weight the word items to reflect their importance in the text. (3) The Jaccard method can be used in conjunction with other semantic similarity measures, such as cosine similarity and edit distance, in order to comprehensively assess the similarity of the text and to reduce the bias that may be introduced by a single method. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
BOW [6] | Treating text (or images) as a collection of words (or features) without regard to the order or grammatical relationships of those words (or features). | (1) Extract features or vocabulary of the text or image. (2) Count the frequency of occurrence of these features or words in the text or image. (3) Combine these frequency values into a vector as a representation of the text or image. | Inability to relate to lexical order and semantic similarity. | The n-gram model [41] and word embedding techniques [7,39] can be used as an improvement scheme. |
TF-IDF [40] | Evaluating the importance of words in a document by combining word frequency (TF) and inverse document frequency (IDF), | The TF-IDF algorithm evaluates the weight of a word in a document by calculating its frequency in the document and its prevalence in the whole corpus to achieve an efficient processing of the text. | Simplified processing of IDF may lead to loss of information and underestimation of keywords in similar texts. | TF-IWF improvement methods can be used or combined with other text representation techniques, such as word embedding models [7,39]. |
TP-IDF [42] | Improvement of the traditional TF-IDF method by combining the positional information of lexical items in the document and the inverse document frequency, especially when dealing with short texts. | By combining the position information of word items in documents with the TF-IDF model, the accuracy of assessing the importance of word items in documents is improved. This method is especially suitable for short text analysis, which can effectively use the word item position information to optimize the calculation of word item weights. | How to rationally define positional factors to accurately reflect lexical item importance, the lack of stability of lexical item frequency and positional information in short texts, the interference of deactivated and noisy lexical items, and the computational complexity that may be associated with dealing with large document collections. | The efficiency and accuracy of tp-idf in processing textual data is improved by optimizing the definition of positional factors through experiments or domain knowledge, enhancing the accuracy of word weights by combining them with other textual features, deactivating words on documents to reduce noise interference, and optimizing algorithms and data structures to reduce computational complexity. |
Word2vec [7] | Quantitative representation of semantic information is achieved by mapping words to continuous vectors through neural network models. | Two training methods, CBOW (continuous bag of words) and Skip-gram(word-skipping model), are used to learn word vectors by context-predicting words and word-predicting context, respectively. | (1) Existence of ignored word order. (2) Inability to handle multiple meanings of words. (3) OOV words. | (1) Models such as RNN [43] or Transformer [16] can be combined to capture word order information. Dynamic word vector models such as ELMo [44] and BERT [17] are used to handle multiple meanings of a word. (2) Introducing character embedding [7,39] or pre-trained language models to solve the OOV word problem. |
Word2VecF [45] | Word2VecF is a flexible and controllable implementation of Word2Vec that allows users to customize word embeddings by providing word–context tuples to train models using global contextual information. | The method includes data preparation, preprocessing, model training, and word embedding generation. The model is directly trained to generate custom word embeddings by preprocessing user-provided tuple data, combined with vocabulary lists and subsampling. | The impact of data quality on model performance, the complexity of parameter tuning, and the consumption of computational resources when dealing with large-scale datasets. | Solutions include data cleansing and filtering to improve data quality, parameter optimization to find the best configurations, use of distributed computing to reduce computational resource consumption, and model integration to improve the accuracy and robustness of word embeddings. |
Doc2Vec [46] | Doc2Vec’s Distributed Memory model achieves the capture of semantic information at the paragraph level through a combination of paragraph vectors (representing paragraph topics) and local context words, thus improving the accuracy of text representation and prediction. | The method trains the model by predicting the next word in the text based on the input of paragraph vectors and local context words. During the training process, the paragraph vectors and word vectors are updated to capture their respective semantic information, and the parameters are optimized iteratively to improve the predictive performance of the model. | (1) High computational cost: when Doc2Vec handles a large number of documents, the computational cost increases significantly with the number of documents. (2) Difficulty in scaling: For new documents, Doc2Vec usually needs to retrain the whole model, which limits its ability in dealing with large-scale or dynamically changing datasets. | (1) Optimization algorithms: reduce training cost of Doc2Vec and improve computational efficiency by introducing optimization algorithms such as negative sampling and hierarchical softmax. (2) Incremental learning: adopting an incremental learning approach allows Doc2Vec to update only some of the parameters when processing new documents, avoiding global retraining and enhancing scalability. |
Doc2VecC [47] | A new approach based on Doc2Vec is introduced to capture the global context of a document. It does this by randomly sampling words in a document and computing the average of these word vectors to represent the semantics of the entire document, and then using this document vector to predict the current word along with local context words. | First, a portion of the words are randomly extracted from the document in each training iteration; then, the average of the vectors of these extracted words is computed to form a document vector; and finally, this document vector and the local context words are used to predict the currently processed word. | (1) The sparsity of the training data may lead to limited model-training effectiveness. (2) The generated document vectors are usually high-dimensional and dense, which makes the vectors difficult to understand and visualize intuitively. (3) Processing all the words in the whole document may lead to computational inefficiency. | Increase the amount of training data by collecting more relevant documents or using data augmentation techniques; secondly, explore interpretable vector representations such as topic modelling to improve the interpretability and visualization of the document vectors; and lastly, construct document vectors using only some of the words in documents to improve the computational efficiency of the training process. |
glove [39] | Constructing word vector representations to capture the global statistical information of words in the corpus through co-occurrence matrices. | Construction of co-occurrence matrices and iterative optimization of word vectors | Insufficient processing of rare words, missing local contextual information, and high computational complexity. | External knowledge bases [48] can be introduced to strengthen the rare word representation, combined with local context methods to enhance semantic understanding, optimized algorithms and model structures to reduce the computational burden, and distributed computing frameworks to accelerate the processing of large-scale corpora. |
LSA [8] (Latent Semantic Analysis) | Revealing implicit semantics in text by mapping high-dimensional document space to low-dimensional latent semantic space. | By performing a singular value decomposition or a non-negative matrix decomposition of this matrix, LSA is able to obtain a low-dimensional topic vector space and map words and text to that space. | (1) Dimensionality catastrophe. (2) Data sparsity. (3) Lack of interpretability. (4) Inability to handle word order and syntactic information. (5) Sensitivity to parameters and decomposition methods. | (1) Feature selection [49], dimensionality reduction techniques. (2) Data preprocessing [50], sparse matrix processing techniques. (3) Topic labeling, visualization techniques. (4) Combination of other text representation methods: e.g., word embedding techniques such as word2vec [7], BERT [17], use of more complex models: e.g., recurrent neural network (RNN) [43] or Transformer [16], etc. (5) Parameter tuning, comparing different decomposition methods, e.g., evaluating the performance of different decomposition methods, such as SVD, NMF, etc., on a specific task and selecting the most suitable method |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
MaLSTM [9] | The model captures the core meaning of a sentence through fixed-size vectors and learns to structure the representation space to reflect semantic relationships. | The model utilizes LSTM to construct sentence representations and predicts semantic similarity by computing the Manhattan metric between vectors. | (1) The effect of data sparsity and noise on the model’s ability to capture low-frequency lexical and semantic details. (2) Limitations of models in understanding complex semantic relationships. (3) Computational efficiency issues limit application of models on large-scale datasets. | (1) Data preprocessing [50] and enhancement techniques [53]. (2) Introduction of attention mechanisms [52]. (3) Optimization of model structure and parameters, migration learning using pre-trained models [54]. |
DBDIN [55] | Lies in capturing semantic relevance from both directions and deep interaction through multiple attention-based interaction units for more accurate semantic matching. | Precise semantic matching between sentences is achieved through two-way interaction and attention mechanisms, and interaction effects are enhanced through deep fusion and self-attention mechanisms. | (1) Higher model complexity. (2) Models may have difficulty accurately capturing certain complex or ambiguous semantic relationships. (3) Affected by noisy data or uneven distribution of training data. | (1) Adopt more efficient attention mechanisms or reduce number of network layers. (2) Introduce more contextual information or utilize pre-trained models (3) Data enhancement [56] and noise filtering. |
Combining character-level bi-directional LSTM and Siamese architectures [10] | Deep earning models using bi-directional LSTM and Siamese architecture to learn character sequence similarity, capture semantic differences, and ignore non-semantic differences. | String embeddings are compared via bi-directional LSTM and Siamese architectures; models are trained using similarity information, and datasets are augmented to enhance generalization. | (1) High model complexity. (2) Excessive attention to detail. (3) Sensitivity to noisy data. | (1) Reduce number of network layers or use efficient algorithms to reduce complexity and reduce computational resource consumption. (2) Combine word-level or sentence-level embedding to compensate for the information loss of character-level processing. (3) Remove noisy data and outliers to improve model robustness. Reduce the risk of overfitting and improve model stability and generalization ability. |
MKPM [11] | Proposes a sentence matching method based on keyword pair matching that combines attention and two-way task architecture to improve matching accuracy. | The method utilizes an attention mechanism to filter keyword pairs and models semantic relationships through a two-way task that combines word-level and sentence-level information to improve matching. | (1) Keyword pair selection problem. (2) Complex semantic relationship-processing limitations. (3) Model complexity and computational cost. | (1) Improve keyword pair extraction and selection. (2) Expand or dynamically adjust the number of keyword pairs. (3) Reduce computational cost and improve model efficiency and scalability by optimizing model structure and algorithms. |
Enhanced-RCNN [57] | Enhanced-RCNN combines the advantages of CNN and attention-based RNN for sentence similarity learning, aiming to capture similarities and differences between sentences more accurately, while reducing computational complexity. | Siamese multilayer CNN is utilized to extract sentence key information, attention-based RNN is employed to capture the interaction effects between sentences, and a fusion layer is designed to integrate the outputs of the CNN and RNN in order to calculate the similarity score. | (1) Limited ability to process semantic relations. (2) Dataset limitations. (3) As model complexity increases, consumption of computational resources and time may become a challenge | (1) Enhance the model’s understanding of complex semantic relationships by introducing advanced techniques such as context embedding or pre-trained language models. (2) Improve the generalization ability of the model to adapt to more scenarios and domains by increasing the size and diversity of the dataset. (3) Optimizing model structure and algorithms, exploring integrated learning [58] and migration learning [54]. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
DSSTM [12] | This model considers the importance of distance markers and combines self-attention and co-attention mechanisms to enhance sentence semantic modeling and extract cross-sentence interaction features for sentence semantic matching tasks. | The sentences are first encoded and embedded, then the distance-aware self-attention mechanism is used to enhance the semantic modeling, followed by the application of a co-attention layer to extract cross-sentence interaction features, and the fusing of these features into a multilevel matching function, and finally the model is validated by experiments in terms of effectiveness and performance. | (1) Computational efficiency issues. (2) Inadequate capture of deep semantic structure. (3) The design of the multilevel matching function may need to be further optimized to better capture the subtle differences and diversity between sentence pairs. | (1) Optimize model structures and algorithms. (2) Utilize more complex semantic representations such as pre-trained language models or knowledge graphs. (3) Optimizing multilevel matching function, designing more reasonable loss function and training strategy. |
DRCN [13] | The architecture effectively preserves the original and synergetic attention feature information by means of dense connectivity, which enhances the understanding of semantic relations between sentences and improves the performance of the sentence-matching task. | A sentence-matching architecture is proposed that combines a collaborative attention mechanism and dense connectivity to capture and preserve semantic information and feature representations between sentences, while an autoencoder is used to compress features and maintain model efficiency. | (1) Increased computational complexity and feature dimensions. (2) Information loss due to autoencoder. | (1) Reduce computational complexity by utilizing advanced feature selection algorithms to reduce redundant features and introducing lightweight network structures to reduce the number of parameters. (2) Employ finer reconstruction loss functions to ensure that key information is retained as much as possible when compressing features. Regularization terms are introduced to constrain the coding process and reduce information loss. |
DRr- Net [14] | This method solves the problem of insufficient semantic comprehension that may result from selecting important parts at once by simulating the dynamic attention mechanism of human reading, focusing on a small part of the sentence area and rereading important words in each step, so as to understand the sentence semantics more accurately. | (1) Introducing Attention Stack Gated Loop Unit (ASG). (2) Design of dynamic rereading unit (DRr). | (1) Insufficient richness of training data. (2) Impact of data preprocessing quality. | (1) Optimizing training strategies and increasing training data. (2) Improving data preprocessing techniques [50], incorporating linguistic knowledge and rules. |
BiMPM [15] | A bilateral multi-view matching model is used to achieve the comprehensive and in-depth matching of sentences and improve processing performance. | BiLSTM is utilized to encode sentences, obtain matching information through bilateral matching and multi-view matching mechanisms, aggregate these into matching vectors, and make decisions to improve matching accuracy. | (1) Higher model complexity. (2) Limited ability to handle long sentences. (3) Reduced performance in cross-domain or cross-language tasks. | (1) Reduce complexity by optimizing the model structure, reducing the number of parameters, or using a lightweight network structure to reduce computation and storage requirements. (2) Segmentation processing strategy to ensure that the model can adequately capture key information when processing long sentences. (3) Utilizing domain-adaptive techniques or cross-language pre-training models, incorporating external knowledge [48] or resources to enhance the context-awareness of the models in cross-domain or cross-language tasks. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
Wpath [59] | A new semantic similarity measure, wpath, is proposed, which combines structural information and the information content of concepts in a knowledge graph to assess semantic similarity more comprehensively and accurately. | The final semantic similarity is derived by calculating the information content and the shortest path length of the concepts, and weighting the path length using the information content. This approach integrates information of multiple dimensions of concepts in the knowledge graph and improves the accuracy of the similarity calculation. | (1) Sparse knowledge graph applicability. (2) The computation of wpath methods may become complex and time-consuming in large knowledge graphs. (3) Information content, while reflecting conceptual universals, is not sufficient to fully capture semantic relationships. | (1) Increase path diversity to improve reliability of similarity measures in sparse graphs. (2) Reduce computational complexity in large-scale graphs by optimizing graph structure or using efficient algorithms. (3) Evaluate semantic similarity more comprehensively by combining contextual information, co-occurrence relations, etc. (4) Adjust the weights according to the knowledge graph characteristics to obtain a more accurate similarity measure. |
GMN [60] | A neural graph matching network framework (GMN) is proposed to solve the performance problem in Chinese short text matching by utilizing multi-granularity input information and an attention graph matching mechanism. | The GMN method uses word lattice graphs as inputs, updates node representations through the graph matching attention mechanism, fuses multi-granularity information, and can be combined with pre-trained language models to improve matching performance. | (1) Higher computational complexity. (2) Insufficient interpretability. (3) Dependence on pre-trained models. (4) Data quality and quantity challenges. | (1) Improve the graph matching attention mechanism with distributed [61] or parallel processing techniques [62]. (2) Enhance the study of internal mechanisms to provide visualization or explanatory methods. (3) Explore structures that do not rely on external models or utilize transfer learning methods [54]. (4) Expand training data by utilizing techniques such as data augmentation [63] and semi-supervised learning. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
BERT [17] | It lies in the use of a bi-directional encoder and self-attention mechanism to capture the deep semantic and contextual information of the text, and it realizes the efficient application of the model through two phases: pre-training and fine-tuning. | BERT uses the Transformer model as an infrastructure and is pre-trained with large-scale unlabeled text data. | (1) High consumption of computing resources. (2) Insufficient learning. (3) Difficulty in processing long text. | (1) Model compression [64,65] and optimization, distributed training. (2) Pre-training data enhancement [63], task-specific fine-tuning. (3) Text truncation and splicing, designing model variants for long texts [18]. |
SpanBERT [18] | SpanBERT optimizes the pre-training process by masking successive text segments and predicting their contents to better capture contextual and semantic information in the text, and it is particularly suitable for tasks that require the recognition and exploitation of text segments. | (1) Masking continuous text segments. (2) Prediction using boundary representation. | (1) Over-reliance on contextual information. (2) Randomness of masking strategies. | (1) Balanced learning of contextual information and segment internal details is achieved by designing adaptive masking method that combines multiple masking strategies. (2) Adopting means such as increasing the number of training rounds, expanding the dataset, or using advanced optimization algorithms, assisted by unsupervised pre-training tasks, to improve the stability and generalization ability of the model. |
XLNet [19] | Combined modeling of mask language models and alignment language models, | (1) Randomized masking. (2) Target position sampling. (3) Self-attention and bi-directional contexts. (4) Pre-training and fine-tuning. | (1) Computational complexity problems. (2) Long sequence-modeling problem. (3) Large-scale pre-training problem. (4) Explosive growth of permutations problem. (5) Limitations of autoregressive models. | (1) Approximate attentional mechanisms or sparse attention [66]. (2) Chunked processing and intermediate layer delivery [67]. (3) Transfer learning [68]. (4) Approximate sampling and masking techniques [69]. (5) Localized windows and window relative position coding [70]. |
Roberta [20] | The key idea of Roberta’s model is to further improve the model’s performance on natural language processing tasks by improving and optimizing the structure and training strategy of the BERT model. | This study optimizes BERT’s performance by replicating the BERT pre-training process, tuning hyperparameters, and evaluating different training data volumes, and it demonstrates that its performance is comparable to other state-of-the-art models. | (1) Hyperparameter tuning may not be comprehensive, leaving out some of the more optimal model settings. (2) Training a language model requires significant computational resources and time, limiting in-depth exploration. (3) The study mainly focuses on the BERT model itself and does not consider combining it with other state-of-the-art models, which may limit performance improvement. | (1) Expand hyperparameter searches to incorporate automated optimization techniques, such as Bayesian optimization or grid searches, to fully explore model configurations. (2) Utilize distributed computing or cloud computing resources to accelerate the training process and improve research efficiency. (3) Combine BERT with other advanced models or techniques, such as integrated learning [58] and knowledge distillation [71], to enhance model performance. Also draw on other domain optimization methods, such as transfer learning [54], meta-learning, etc. to improve the pre-training process. |
GPT [21] | Understanding natural language through unsupervised learning, generating coherent text sequences in a unidirectional decoder structure, and applying it to a variety of NLP tasks through a pre-training–fine-tuning framework. | Pre-training and fine-tuning with a one-way Transformer decoder, learning language representations from large-scale data through self-supervised learning, and applying it on specific NLP tasks to improve performance. | (1) Problem of bias in training data. (2) Problem of dialog consistency. (3) Problem of processing long text. | (1) Diverse training data [72]. (2) Introduction of dialog history [73]. (3) Segmented processing [74]. |
GPT2 [22] | Pre-training using the Transformer network structure to generate text and model the contextual context by autoregression to learn a generic language representation. | Pre-training, Transformer network architecture, Contextual Context Modeling, Multi-Layer Representation and Multi-Headed Attention, and fine-tuning. | (1) Can only realize the one-way text generation problem. (2) The problem of more parameters and high training costs. (3) Lack of common sense and world knowledge problem. (4) The problem of being susceptible to input bias. (5) Lack of consistency and controllability problem. (6) Difficulty in handling rare events problem. (7) Safety and ethical issues. | (1) Bi-directional encoders [72]. (2) Model compression techniques [75,76]. (3) Introduction of external knowledge [48]. (4) Dealing with input bias [77]. (5) Improving consistency and controllability [78]. (6) Data augmentation and domain-specific training [20]. (7) Safety and ethical measures [79]. |
GPT3 [23] | GPT-3 is based on a large-scale pre-trained language model, pre-trained by the Transformer architecture and massive text data. It is capable of generating coherent text with broad adaptability and flexibility, and it can demonstrate powerful language-processing capabilities on a variety of tasks. | Pre-training utilizes self-supervised learning to gain linguistic knowledge and patterns. Fine-tuning optimizes the model for specific tasks with labeled data. | (1) The problem of the high cost of training and reasoning. (2) Problem of lack of common sense and reasoning skills. (3) The problem of sensitivity to input bias. (4) Difficulty in ensuring consistency of generated content. | (1) Model size and efficiency optimization [80]. (2) External knowledge introduction [81,82,83], data augmentation [63], and domain-specific training. (3) Improved interpretability [84] and control, social and ethical norms [85]. (4) Providing more contextual information, introducing additional constraints, tuning model parameters, incorporating human review or post-editing, and using larger training datasets. |
GPT4 [24] | A Transformer-based multimodal pre-trained language model is used to predict the next token in a document. | GPT-4 is a generative pre-trained model that uses an autoregressive Transformer architecture and uses more parameters and hierarchies in training to enhance its representation. It also utilizes larger datasets for pre-training to improve its understanding of different domains and contexts. | (1) Problems of illusory facts and faulty reasoning. (2) Problem of lack of understanding of updated data. (3) Problems of confident error and bias presence. | (1) Increasing the amount and diversity of training data [86], manual intervention for correction [87], continuous updating of the model e.g., reference to the BARD model to incorporate new knowledge and update the structure [88] and enhancement of the model’s interpretability [89]. (2) Adding the latest training data, optimizing the design of the GPT-4 algorithm e.g., ChatABL method [90], enhancing the quality control of the input data [91], and developing better evaluation metrics [92]. (3) Checking the quality of input data [93], increasing the dataset [94] and the number of training sessions, and using integrated learning techniques [58]. |
PANGU-Σ [25] | Improving the Performance of Natural Language Processing Tasks Using Large-Scale Autoregressive Language Models Obtained by Pre-Training on Massive Text Corpora. | An autoregressive approach is used, containing stacked Transformer decoding and query layers. The bottom M layers are shared across all domains, while the top N layers are sparsely activated based on the domain of the input data. Each RRE layer consists of K experts in a G-group to provide a mixed, dense, and sparse mode selection. | (1) The problem of constructing a user-friendly sparse architecture system. (2) The problem of developing a language model for obtaining accurate feedback from open environments. (3) The problem of effectively combining language models with multimodal perceptual inputs. (4) Overcoming the cost of deploying large language models. (5) The problem of effectively storing and updating the knowledge required for large language models online. | (1) Designing optimized algorithms and system architectures. (2) Improve data-labeling methods and self-supervised learning techniques. (3) Design multimodal fusion network architecture. (4) Using model compression and optimization techniques. (5) Designing efficient storage systems and incremental training algorithms. |
Models | Key Ideas | Approaches | Key Problems | Solutions |
---|---|---|---|---|
R2-Net [26] | The accuracy of sentence semantic matching is improved by deep mining the semantic information and relational guidance of tags. | First, BERT and CNN encoders are combined to capture the semantic information of sentences from the global and local perspectives, respectively; second, a self-supervised relation classification task is introduced to guide the matching process using the semantic relations between labels; and finally, a ternary loss function is used to refine relation distinctions and improve the model performance. | The consumption of computational resources is high, and the speed is limited when dealing with large-scale datasets; the quality of labels affects the performance of the model, and the performance may be degraded in the presence of noisy or inaccurate labels; and the hyperparameter tuning and optimization is also a major challenge, as it needs to be carefully adjusted for specific tasks. | For the model complexity problem, model compression and distributed computing can be used to improve efficiency; for the label quality problem, semi-supervised learning or migration learning can be explored to make up for deficiencies; and the manual workload can be reduced by utilizing automatic hyperparameter-tuning tools. The combined application of these solutions can improve the model’s performance and practicality. |
ASPIRE [27] | This study proposes a method to accurately assess the similarity between scientific documents by matching fine-grained aspects using co-cited sentences in papers as natural supervised signals. This approach aims to capture the deeper associations between documents for a more precise and nuanced similarity assessment. | The ASPIRE method is based on sentence context representation and aggregation techniques to efficiently assess inter-document similarity by capturing fine-grained matches at the sentence level through multi-vector modeling. | (1) Data sparsity and noise. (2) Computational complexity. | (1) Data enhancement [63] and filtering. (2) Model Optimization and Acceleration Simplifying structure, dimensionality reduction, and combining parallel and distributed computing [61] to improve the efficiency of processing large-scale corpora. |
DC-Match [28] | This study addresses the need for matching at different granularities in the text semantic matching task, and it proposes to optimize the matching performance and improve the matching accuracy by distinguishing between keywords and intent and adopting a partitioning strategy. | The proposed DC-Match training strategy contains three objectives: global matching model classification, remote supervised classification to distinguish between keywords and intent, and the special training of partitioning ideas to ensure that the distribution of global matches is similar to the distribution of combinatorial solutions distinguishing between keywords and intent, to achieve more accurate matches. | (1) The accuracy of the remotely supervised classification loss is affected by the quality of external resources, which may lead to misleading keyword extraction. (2) The special training objective of the partitioning idea may increase the computational resources and model complexity, and prolong the training time. (3) DC-Match methods may not be flexible enough and may need to be adapted for specific scenarios. | (1) Optimize the selection of external resources and quality control, and use advanced remote supervision technology to improve the accuracy of supervision information. (2) Optimize the algorithm and model structure to reduce the amount of computation, and use distributed or parallel computing to accelerate the training process [25]. (3) Introduce domain adaptive or migration learning techniques [68] to improve the flexibility and adaptability of DC-Match methods in different domains and types of text-matching tasks. |
MCP-SM [29] | Proposes a multi-concept parsing approach to optimize multilingual semantic matching and reduce the dependence on traditional named entity recognition. | The study proposes an MCP-SM framework based on a pre-trained language model that parses text into multiple concepts and incorporates categorical tokens to enhance the model’s semantic representation capability and achieve flexible multilingual semantic matching. | (1) Uncertainty in concept extraction. (2) Differences in language and domain. (3) Limitations of computational resources. | (1) Optimize the concept extraction algorithm. (2) Introducing multi-language and domain adaptation mechanisms. (3) Optimize the model structure and computation process. |
Models | Parameters | Hardware Specifications | Training Duration | Context Learning |
---|---|---|---|---|
DBDIN [55] | 7.8 M | NVIDIA GTX1080 GPU card (NVIDIA, Santa Clara, CA, USA) | - | Yes |
MKPM [11] | 1.8 M | - | - | Yes |
Enhanced-RCNN [57] | 7.7 M | Nvidia P100 GPU (NVIDIA, Santa Clara, CA, USA) | - | Yes |
DRCN [13] | 6.7 M | - | - | Yes |
DC-Match [28] | - | RTX 3090 GPU | - | Yes |
MCP-SM [29] | - | A100 GPUs | - | Yes |
R2-Net [26] | - | Nvidia Tesla V100-SXM2-32 GB GPUs (NVIDIA, Santa Clara, CA, USA) | - | Yes |
BERT [17] | 340 M | Nvidia A100, V100 (NVIDIA, Santa Clara, CA, USA) | Depends on model parameter scale | Yes |
SpanBERT [18] | - | V100 GPU | 15 days | Yes |
GPT [21] | 125 M | - | - | Yes |
XLNet [19] | 340 M | 512 TPU v3 | 5.5 days | Yes |
Roberta [20] | 355 M | V100 GPU | about 1 day | Yes |
GPT2 [22] | 1.5 B | - | - | Yes |
GPT3 [23] | 175 B | V100 GPU | - | Yes |
GPT4 [24] | 1.8 Trillion | - | - | Yes |
PANGU-Σ [25] | 1.085 T | Ascend 910 NPU (Huawei, Shenzhen, China) | about 100 days | Yes |
Models | Accuracy/MAP/MRR/r/ρ/MSE/F1/GLUE | Task |
---|---|---|
MaLSTM [9] | Pearson correlation (r), Spearman’s ρ and MSE are 0.8822, 0.8345, and 0.2286, respectively. The accuracy of MaLSTM features + SVM is 84.2%. | The Pearson correlation (r), Spearman’s ρ, and MSE of MaLSTM on SICK are 0.8822, 0.8345, and 0.2286, respectively. The accuracy of MaLSTM features combined with SVM is 84.2%. |
DBDIN [55] | The accuracy rates are 88.8%, 86.8%, and 89.03% respectively. | The accuracy rates of DBDIN on SNLI, SciTail, and QQP are 88.8%, 86.8%, and 89.03%, respectively. |
MKPM [11] | The accuracy rates are 86.71%, 84.11%, 89.66%, and 88.2% respectively. | The accuracy rates of MKPM on LCQMC, BQ, QQP, and SNLI are 86.71%, 84.11%, 89.66%, and 88.2%, respectively. |
Enhanced-RCNN [57] | The accuracies (Acc) are 89.3%, 75.51%, 90.28%, and 90.35%, respectively, while the F1 scores are 89.47%, 73.39%, 76.85%, and 74.20%, respectively. | The accuracy (Acc) of Enhanced-RCNN on QQP and Ant Financial are 89.3% and 75.51%, respectively, with F1 scores of 89.47% and 73.39% respectively. The accuracy (Acc) of Enhanced-RCNN (ensemble) on QQP and Ant Financial are 90.28% and 90.35%, respectively, with F1 scores of 76.85% and 74.20%, respectively. |
DRCN [13] | The accuracy rates are 86.5%, 90.1%, 80.6%, 79.5%, 82.3%, 81.4%, and 91.3%, respectively. | DRCN (-Attn, -Flag) achieved an accuracy of 86.5% on SNLI, while DRCN achieved 90.1%. On MultiNLI, DRCN and DRCN + ELMo* achieved accuracies of 80.6% and 79.5%, and 82.3% and 81.4%, on matched and mismatched datasets, respectively. DRCN achieved an accuracy of 91.3% on QQP. |
The MAP values are 0.804, 0.830, and 0.925, while the MRR values are 0.862, 0.908, and 0.930, respectively. | DRCN achieved MAP and MRR of 0.804 and 0.830, respectively, on TrecQA: raw, and 0.862 and 0.908, respectively, on TrecQA: clean. On SelQA, DRCN achieved MAP and MRR of 0.925 and 0.930. respectively. | |
BiMPM [15] | The accuracy is 88.17% and 88.8% respectively. The MAP values are 0.802 and 0.875, while the MRR values are 0.718, and 0.731, respectively. | The accuracy of BiMPM on QQP and SNLI is 88.17% and 88.8%, respectively. On the TREC-QA and WikiQA datasets, the MAP values are 0.802 and 0.875, while the MRR values are 0.718 and 0.731, respectively. |
BERT [17] | The GLUE scores are 84.6/83.4, 71.2, 90.5, 93.5, 52.1, 85.8, 88.9, 66.4, 86.7/85.9, 72.1, 92.7, 94.9, 60.5, 86.5, 89.3, 70.1, and 82.1, respectively. | The GLUE scores of BERT-base on MNLI (-m/mm), QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE are 84.6/83.4, 71.2, 90.5, 93.5, 52.1, 85.8, 88.9, and 66.4 respectively. While the GLUE scores of BERT-large on MNLI (-m/mm), QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE are 86.7/85.9, 72.1, 92.7, 94.9, 60.5, 86.5, 89.3, and 70.1, respectively. |
SpanBERT [18] | The GLUE scores are 64.3, 94.8, 90.9/87.9, 89.9/89.1, 71.9/89.5, 88.1/87.7, 94.3, and 79.0, respectively. | The GLUE scores of SpanBERT on CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, and RTE are 64.3, 94.8, 90.9/87.9, 89.9/89.1, 71.9/89.5, 88.1/87.7, 94.3, and 79.0 respectively. |
XLNet [19] | The GLUE scores are 77.4, 88.4, 93.9, 91.8, 81.2, 94.4, 90.0, 65.2, and 91.1, respectively. | The GLUE scores of XLNet-Large wikibooks on RACE, MNLI, QNLI, QQP, RTE, SST-2, MRPC, CoLA, and STS-B are 77.4, 88.4, 93.9, 91.8, 81.2, 94.4, 90.0, 65.2, and 91.1, respectively. |
Roberta [20] | The GLUE scores are 90.8/90.2, 98.9, 90.2, 88.2, 92.3, and 67.8, respectively. | RoBERTa’s GLUE scores on MNLI, QNLI, QQP, RTE, MRPC, and CoLA are 90.8/90.2, 98.9, 90.2, 88.2, 92.3, and 67.8, respectively. |
GPT [21] | The accuracy rates are 82.1, 81.4, 89.9, 88.3, 88.1, and 56.0, respectively. The MC, ACC, F1, PC, and F1 scores are 45.4, 91.3, 82.3, 70.3, and 82.0, respectively. | GPT achieves accuracy rates of 82.1/81.4, 89.9, 88.3, 88.1, and 56.0 on MNLI (-m/mm), SNLI, QNLI, SciTail, QNLI, and RTE, respectively. On CoLA, SST2, MRPC, QQP, and STSB, GPT’s scores are 45.4 for MC, 91.3 for ACC, 82.3 for F1, 70.3 for PC, and 82.0 for F1. |
GPT3 [23] | The accuracy rate is 69%. | The accuracy rate of GPT-3 Few-Shot on RTE is 69%. |
PANGU-Σ [25] | The accuracy rates are 51.14%, 45.97%, 68.49%, and 56.93%, respectively. | The accuracy rates of PANGU-Σ on CMNLI, OCNLI, AFQMC, and CSL are 51.14%, 45.97%, 68.49%, and 56.93%, respectively. |
GMN [60] | The accuracy rates are 84.2% and 84.6%, and the F1 scores are 84.1% and 86%, respectively. | The GMN model achieves accuracy rates of 84.2% and 84.6% on LCQMC and BQ, respectively, with F1 scores of 84.1% and 86%. |
DRr- Net [14] | The accuracy rates are 87.7%, 71.4%, 76.5%, 88.3%, and 89.75%, respectively. | The accuracy rates of DRr-Net on different SNLI datasets are 87.7%, 71.4%, and 76.5%, respectively, while the accuracy rates on the SICK and QQP datasets are 88.3% and 89.75%, respectively. |
DC-Match [28] | The accuracy rates are 91.7%, 92.2%, 88.1%, 88.9%, 73.73%, and 74.22%, respectively, while the Macro-F1 scores are 72.96% and 73.67%. | DC-Match (RoBERTa-base) and DC-Match (RoBERTa-large) achieve accuracy rates of 91.7% and 92.2% on QQP, 88.1% and 88.9% on MRPC, and 73.73% and 74.22% on Medical-SM, respectively. The Macro-F1 scores for DC-Match (RoBERTa-base) and DC-Match (RoBERTa-large) are 72.96% and 73.67%, respectively. |
MCP-SM [29] | The accuracy rates are 91.8%, 92.4%, 88.9%, 89.7%, 73.90%, 74.82%, 76.59%, 76.05%, 71.93%, and 72.49%, respectively. The Macro-F1 scores are 73.31%, 73.90%, 76.59%, 76.05%, 71.98%, and 72.42%, respectively. | MCP-SM (DEBERTa-base) and MCP-SM (DEBERTa-large) achieve accuracy rates of 91.8% and 92.4% on QQP, and 88.9% and 89.7% on MRPC, respectively. MCP-SM (MacBERT-base) and MCP-SM (MacBERT-large) achieve accuracy rates of 73.90% and 74.82% on Medical-SM, with Macro-F1 scores of 73.31% and 73.90%, respectively. MCP-SM (ARBERT) achieves an accuracy rate of 76.59% and a Macro-F1 score of 76.05% on the MQ2Q dataset. MCP-SM (AraBERT-base) and MCP-SM (AraBERT-large) achieve accuracy rates of 71.93% and 72.49% on XNLI, with Macro-F1 scores of 71.98% and 72.42%, respectively. |
R2-Net [26] | The accuracy rates are 90.3%, 81%, 89.2%, 92.9%, 91.6%, and 84.3%, respectively. | The accuracy rates of R2-Net on the Full test and Hard test of the SNLI dataset are 90.3% and 81%, respectively, 89.2% on SICK, 92.9% on SciTail, and 91.6% and 84.3% on Quora and MSRP, respectively. |
API | API-Providing Companies | Supported Languages | Interface Type | Application Scenario | Dominance | Drawbacks |
---|---|---|---|---|---|---|
Google Cloud Natural Language API | polyglot | RESTful API interface | Suitable for search engine optimization, information retrieval, sentiment analysis, and many other scenarios | (1) Leading technology, based on Google’s powerful AI and NLP technology. (2) Provides rich natural language processing functions. (3) High accuracy and reliability. | (1) May be subject to Google’s terms of service and geographic restrictions. (2) May be expensive to use, especially for large-scale projects. | |
IBM Watson Natural Language Understanding API | IBM | polyglot | RESTful API interface | It is suitable for Q&A system, intelligent customer service, sentiment analysis, and other fields. | (1) Combined with Watson’s cognitive computing capabilities, it provides in-depth text understanding and matching. (2) Powerful customization capabilities that can be configured according to specific needs. | (1) May require higher technical investment and learning costs. (2) IBM services and fees may be more expensive. |
Microsoft Azure Text Analytics API | Microsoft | polyglot | RESTful API interface | It is suitable for scenarios such as text mining, theme analysis, and sentiment tendency judgement. | (1) Seamlessly integrates with other services on the Azure platform. (2) Provides efficient and stable text-processing capabilities. | (1) May be limited by the ecosystem and fee structure of the Azure platform. (2) Integration may be cumbersome for non-Azure users. |
Amazon Comprehend API | Amazon | polyglot | RESTful API interface | It is suitable for scenarios such as content analysis, topic detection, and sentiment analysis. | (1) Easy to integrate with other AWS services as part of AWS. (2) Provides efficient and scalable text-processing capabilities. | (1) May be subject to AWS terms of service and geographic coverage. (2) Additional integration work may be required for non-AWS users. |
Rasa NLU | Rasa | polyglot | RESTful API interface | It is widely used in customer service, online helpdesks, education platforms, e-commerce, and intelligent assistant and other scenarios, aiming to enhance user engagement and satisfaction. | (1) Provides flexible and powerful natural language understanding functions, including user intent recognition, entity extraction and parameter optimization. (2) Supports pre-loaded training models, providing developers with a convenient way to build dialogue-driven automated intelligent customer service assistants. (3) Active community and leading technology, tracking the most cutting-edge technology and applying it to the system. | (1) For beginners, some learning cost may be required to familiarize and master its usage and API interface. (2) Additional customization and optimization work may be required when dealing with complex or domain-specific natural language understanding tasks. |
Hugging Face Transformers | Hugging Face | polyglot | Model libraries and tool sets | The Hugging Face Transformers library is widely used for tasks related to natural language processing, such as text categorization, sentiment analysis, language translation, and so on. It is particularly popular with developers, providing them with easy-to-use and flexible interfaces to handle NLP tasks. | (1) Provides a range of high-quality pre-trained models, enabling developers to rapidly build and deploy NLP applications. (2) Has an active community and developer ecosystem that provides a wealth of resources and support. (3) High ease of use and flexibility, adapting to a variety of NLP tasks and scenarios. | (1) For complex NLP tasks, in-depth NLP knowledge and experience may be required for model selection and tuning. (2) In some specific domains or scenarios, additional data labeling and model training work may be required. |
Semantic Scholar API | Semantic Scholar | polyglot | RESTful API interface | Focusing on text matching and searching in the academic field, it is suitable for researchers, scholars, and academic institutions to conduct literature searches, citation analysis, research trend tracking, and so on. | (1) Provides high-quality academic text-matching results that can accurately identify relevant academic literature and research content. (2) Integrates a wealth of academic resources covering a wide range of subject areas. (3) Provides a flexible API interface, which is convenient for users to integrate into their own systems or applications. | (1) Primarily for academic fields, with limited applicability to other non-academic fields. (2) May require paid subscription or purchase of specific services to enjoy full functionality. (3) May require higher computational resources and time costs when dealing with large-scale or complex academic text-matching tasks. |
TexSmart Text Matching API | TexSmart | polyglot | HTTP Online Service Interface | This API supports text content matching, accessed via HTTP-POST, providing efficient text matching. | (1) High accuracy to ensure precise text matching. (2) Provides flexible API interfaces and configuration options to meet specific needs. (3) Fast response time, suitable for real-time or high concurrency scenarios. | (1) Relatively low visibility and small user base. (2) May lack thorough documentation and community support. |
Baidu AI Text Similarity API | Baidu | polyglot | RESTful API interface | An API provided by Baidu for calculating the similarity of two texts, supporting multiple languages and domains. | (1) Strong Chinese-processing capability, especially good at processing Chinese text. (2) High stability and reliability to ensure service quality (3) Provides rich API interface and SDK for easy integration. | (1) May be subject to geographical restrictions, and foreign use may be limited or delayed. (2) Further customized development may be required for specific needs. |
AliCloud NLP Text Similarity API | Alibaba | polyglot | API Interface | AliCloud’s natural language processing service also includes an API for text similarity matching. | (1) Strong technical strength and support for high scalability and flexibility. (2) Synergizes with other AliCloud services for easy integration and expansion. (3) Provides powerful data-processing and analysis capabilities. | The cost of use may be high and require some investment of resources. |
Jingdong AI Text Matching API | JD.com | polyglot | API Interface | The Jingdong AI platform provides text-matching services for a variety of text-matching needs. | (1) Applicable to text-matching needs in the field of e-commerce, with industry advantages. (2) Rich in data resources, providing accurate matching potential. (3) Can help merchants achieve accurate marketing and promotion. | (1) There may be domain limitations that apply to specific scenarios. (2) Openness and usability may be limited and need to be evaluated. |
Tencent Cloud NLP Text Matching API | Tencent | polyglot | RESTful API interfaces and SDKs | Tencent Cloud’s natural language processing service includes a text-matching function that can support a variety of application scenarios. | (1) Leading technology, providing advanced NLP technology and algorithms. (2) Supports multi-language processing to meet global needs. (3) Provides customized solutions to meet specific industry and scenario needs. | (1) Cost of use may be high and budgetary constraints need to be considered. (2) Integration into existing systems may present some technical challenges. |
Classification of Datasets | Natural Language Inference and Implication (NLI) Dataset | Q&A and Matching Datasets | Sentiment Analysis Datasets | Other Specific Data Sets | Evaluation Method | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLMs/Dataset | MNLI | QNLI | SNLI | XNLI | RTE | SciTail | CMNLI | OCNLI | TREC-QA | WikiQA | SelQA | LCQMC | BQ | QQP | MQ2Q | SST-2 | MRPC | SICK | Ant Financial | MultiNLI | Medical-SM | CoLA | STS-B | AFQMC | CSL | MSPR | Evaluation algorithm |
MaLSTM [9] | Acc/Pearson correlation (r)/Spearman’s ρ/MSE | ||||||||||||||||||||||||||
DBDIN [55] | Acc | ||||||||||||||||||||||||||
MKPM [11] | Acc | ||||||||||||||||||||||||||
Enhanced-RCNN [57] | Acc/F1 | ||||||||||||||||||||||||||
DRCN [13] | Acc/MAP/MRR | ||||||||||||||||||||||||||
BiMPM [15] | Acc/MAP/MRR | ||||||||||||||||||||||||||
GMN [60] | Acc/F1 | ||||||||||||||||||||||||||
DRr- Net [14] | ACC | ||||||||||||||||||||||||||
DC-Match [28] | Acc/Macro-F1 | ||||||||||||||||||||||||||
MCP-SM [29] | Acc/Macro-F1 | ||||||||||||||||||||||||||
R2-Net [26] | Acc | ||||||||||||||||||||||||||
BERT [17] | GLUE | ||||||||||||||||||||||||||
SpanBERT [18] | GLUE | ||||||||||||||||||||||||||
XLNet [19] | GLUE | ||||||||||||||||||||||||||
Roberta [20] | GLUE | ||||||||||||||||||||||||||
GPT [21] | Acc/MC/F1/PC | ||||||||||||||||||||||||||
GPT3 [23] | Acc | ||||||||||||||||||||||||||
PANGU-Σ [25] | Acc | ||||||||||||||||||||||||||
Size | 392 k | 108 k | 570,152 | - | 2.5 k | 27,026 | - | 5.6 k | - | - | - | 120,000 | 260,068 | 404,276 | - | 67 k | 5801 | 10,000 | 492 k | 433,000 | 48,008 | 10,657 | 5.7 k | - | 5801 | - | |
Source | Transcribed speech, novels, government reports | Converted from SQuAD | Flickr30 k corpus (Young et al., 2014, [95]) | Developed by the Facebook AI Institute in collaboration with a team of NYU researchers | - | Multiple-choice science exams and web sentences (Khot et al., 2018, [96]) | CLUE (Liang Xu, et al., 2020, [97]) | Ocnli Hai Hu, et al., 2020, [98] | - | - | - | LCQMC (Liu et al., 2018, [99]) | BQ (Chen et al., 2018a, [100]) | Quora website (Iyer et al., 2017, [101]) | - | - | Online News | 8 K image Flickr dataset and STS MSR video description dataset (Marelli et al., 2014, [102]) | Alibaba | Built by researchers at New York University, Princeton University, and Georgetown University. | Chinese search engine | Books and journals on language theory | - | CLUE Liang (Xu, et al., 2020, [97]) | CLUE (Liang Xu, et al., 2020, [97]) | Nes clusters from the web (Dolan and Brockett, 2005, [103]) | - |
Domains | Time/Author | Main Contributions | Limitation | Future Research Directions |
---|---|---|---|---|
education | (2024, Alkhidir T et al.) [106] | (1) An AI-based curriculum-planning methodology is proposed to improve planning efficiency by automatically aligning learning objectives within and outside of disciplines through semantic matching techniques. (2) A data science pipeline is designed to analyze the similarity of learning objectives and identify subject clustering and dependencies. (3) Provides a visualization tool for course similarity, which facilitates course cross-mapping analysis by experts. (4) Reveals new similarities between courses and provides a basis for optimizing course structure. | (1) Current research focuses mainly on semantic aspects and does not comprehensively consider the cognitive complexity of learning objectives. (2) Although prerequisite inferences based on student performance have been proposed, the specific method of implementation has yet to be investigated. | (1) Using AI technology to automatically extract learning objectives and keywords from textbooks and link courses across disciplines. (2) Combine learning objectives with cognitive skills to comprehensively assess the complexity of learning objectives. (3) Integrate student performance to study in depth the relationship between prerequisites and learning objectives and optimize curriculum planning. (4) Explore the application of AI technology in personalized course recommendation and adjustment. |
(2021, Hayden K A et al.) [107] | This study provides a comprehensive review of the research methodology of text-matching software in education, reveals its multiple application roles in higher education, and provides important evidence support and methodological insights for academic integrity research. | The failure of this study to provide definitive conclusions regarding the effectiveness of text-matching software and the continued lack of understanding of the limitations of the software and its available evidence demonstrates the limitations and challenges of the current study. | Future research should further explore the practical effects of text-matching software, focus on new developments in the field of academic integrity, and delve into the limitations of the software and its evidence to provide a more comprehensive reference for educational decision-making. | |
study of medicine | (2024, Jeong J et al.) [108] | (1) The X-REM method is proposed for radiology report retrieval generation using image and text matching, and its superiority is verified by quantitative analysis and manual evaluation. (2) X-REM better captures image–report interactions, with multimodal encoders and similarity metrics to enhance performance. | (1) There are still gaps in X-REM compared to reports prepared by radiologists. (2) The study is based on a single dataset, which may be biased, and the performance of datasets external to the model is not empirically validated. (3) The evaluation process, although optimized for clinical and natural language metrics, may have overlooked other important factors. | (1) Expand dataset diversity and size to validate model generalization capability. (2) Deepen the research on multimodal coding and similarity metrics to improve the accuracy of report generation. (3) Incorporate more medical knowledge and contextual information to enhance model applicability. (4) Study more influencing factors and develop more comprehensive evaluation criteria. |
(2019, Luo Y F et al.) [109] | A novel approach combining dictionary lookups and deep learning models is proposed for medical text entity-linking and shows advantages in capturing semantic similarities. | The method is limited in handling data with diverse semantics and no uniform representation, and is affected by the size of small datasets, resulting in limited performance improvement. | Data inconsistencies and ambiguities need to be resolved, more efficient methods of handling semantically diverse data need to be explored, and datasets need to be expanded to improve model performance | |
(2022, Wang L et al.) [110] | This paper proposes the BMA integrated depth architecture, which combines BERT, MSBiLSTM, and a dual-attention mechanism, to solve the semantic similarity matching problem in automatic medical Q&A systems and significantly improve the performance. | BMA model may have limitations in processing specific types of medical texts, its generalization capabilities need to be strengthened, and it may require high computational resources. In addition, its performance is affected by the quality and quantity of training data. | In the future, the application of the model on more diverse and large-scale medical datasets can be explored to optimize the model architecture and improve its generalization capability and efficiency. At the same time, the combination of more natural language processing techniques with BMA can be investigated to further improve performance, and new ways to deal with non-canonical text and expressive diversity can be investigated. | |
economics | (2024, Ajaj S H) [111] | (1) highlights the importance of job-posting quality in recruitment and proposes the concept of an AI-optimized assistant to improve recruitment efficiency. (2) Proposes an AI assistant based on knowledge base and semantic matching aimed at providing accurate job postings and industry-specific recommendations. | Although AI optimization assistants can theoretically improve the quality of job postings, their effectiveness may be affected by a variety of factors, such as the type of job, industry characteristics, and the search habits of job seekers. | Developing and testing AI optimization assistants, studying their applicability across different industries and positions, exploring in-depth candidate search behaviors, and considering more recruitment-related factors. |
(2024, Ren H. et al.) [112] | (1) Fills the research gap of policy tourism in public management cooperation, and puts forward a new perspective on policy tourism for economic cooperation. (2) Using a non-parametric matching method, the study empirically analyzes the impact of policy tourism on inter-provincial trade flows in China, and find that it decreases over time and has a more significant impact on less developed jurisdictions. | (1) The study is mainly based on the Chinese context and may have geographical and cultural limitations. (2) Failure to fully control for all factors that may affect interprovincial trade flows and results may be biased. (3) Lack of in-depth description and classification of specific forms and mechanisms of policy tourism. | (1) Expand the scope of the study to other countries and regions to explore the universality and specificity of policy tourism. (2) Explore in depth the specific forms and mechanisms of policy tourism, and improve understanding of the path and extent of its impact. (3) Combine more data and methods to improve the accuracy and reliability of the impact estimates of policy tourism. (4) To study the interaction between policy tourism and other public management network efforts. | |
(2012, Gopalakrishnan V et al.) [113] | An end-to-end unsupervised system is proposed for matching product titles in different formats. Highly accurate matching is achieved by enriching titles with search engine results, calculating word importance, and optimizing the number of search engine calls. Experiments verify that the system outperforms IDF’s cosine similarity method in terms of F1 score and is efficient and accurate. | The current approach is limited when dealing with product titles with no distinguishing features, and an over-reliance on search engines can lead to performance fluctuations. | (1) Enhancing domain adaptation: improving algorithms to adapt to different product domains. (2) Combining multiple learning methods: fusing unsupervised and semi-supervised learning to improve performance. (3) Reduce reliance on external data sources: develop internal data sources or alternatives. (3) Fusing contextual information: utilizing more contextual information to improve matching accuracy. (4) Handling multimodal data: investigating methods to fuse information from images, videos, etc., to optimize matching. | |
(2020, Akritidis L K et al.) [114] | UPM enables unsupervised product matching through clustering, where combinations of headline terms are constructed and scored, with the highest-scoring combination representing the product identity. Morphological analysis is performed to identify useful tokens and assign them to virtual fields. Virtual fields are scored and used to evaluate the combinations, considering the position and frequency of the combination in the headline. Once clusters are are formed, post-processing validation is performed by blocking conditions to avoid entities coexisting in the same cluster. UPM avoids pairwise comparisons between products, reduces computational complexity, does not rely on additional blocking strategies, and has a small dependence on hyperparameters. | The main limitation of the UPM algorithm is its restricted generality, which is designed and optimized primarily for the e-commerce domain and may not be applicable to other domains. In addition, the algorithm relies on blocking conditions that require domain expertise and in-depth knowledge of the data, increasing the complexity of implementation. The choice of hyperparameters may also affect the matching results and needs to be tuned for specific applications. Finally, UPM currently supports mainly English headings, and its adaptability to multilingual environments needs to be improved. | Future research can extend and optimize the UPM algorithm in the following areas: exploring cross-domain applications, automated blocking condition formulation, automatic tuning of hyperparameters, support for multi-language environments, and integration with other text processing techniques. At the same time, attention is paid to the real-time and scalability of the algorithm to accommodate large-scale datasets and rapid updates. | |
(2013, De Bakker M et al.) [115] | This research proposes and validates a new hybrid similarity method that combines product title and attribute information to significantly improve the accuracy of online product duplicate detection. By constructing a weighted similarity based on attribute key-values and extracting model words from key-value pairs, the method in this paper outperforms the existing techniques in several metrics and provides a new and effective solution in the field of product duplicate detection. | Although the hybrid similarity method performs well, its performance depends on the accuracy and completeness of the dataset, its processing efficiency for large-scale datasets needs to be improved, and it is mainly applicable to products with explicit attribute information. | (1) Introducing more metrics and algorithm optimization: future research could explore more string distance metrics and optimize algorithms to improve processing efficiency. (2) Application of ontology and domain knowledge: research on how to use ontology and domain background knowledge to assist duplicate detection and improve accuracy and efficiency. (3) Research on method extension and adaptability: explore the extension of the method to more types of products, and study how to improve its adaptability and practicality in different domains and scenarios. | |
industry | (2024, Zheng K et al.) [116] | (1) Propose an image–text matching method based on a priori knowledge graph, which improves the model’s ability to understand the real scene. (2) Transformer and BERT are utilized in the matching module to improve the model accuracy. (3) Introducing relative entropy to optimize the loss function and reduce inter-modal feature differences. (4) Experimentally verify the contribution of each module to prove the effectiveness of the method. | (1) Challenges of knowledge graph construction and maintenance of updates are not mentioned. (2) High complexity may affect real-time performance. (3) Other potential problems of cross-modal matching are not explored in depth. | (1) Explore more advanced fusion mechanisms to improve matching performance. (2) Study the effective use and dynamic update of knowledge graph. (3) Optimize the model structure to reduce complexity and improve real-time performance. (4) Study other issues of cross-modal matching in depth. |
agriculture | (2024, Song Y et al.) [117] | To propose a local matching-based retrieval method for digital agricultural text information to improve retrieval efficiency and provide an effective and scientific retrieval method for digital agricultural text information. | (1) The text notes that the data collected were not comprehensive, which may affect the method’s widespread use and accuracy. (2) Although the experimental results show high recall and precision, the text does not discuss in detail how the method performs on different types or sizes of agricultural text data. | (1) Research in this area needs to be further strengthened to collect more comprehensive data to improve the broad applicability and accuracy of the method. (2) Explore the performance of the method on different types and sizes of agricultural text data, as well as the possibility of its application in other related fields. (3) It is possible to study how to integrate more advanced natural language processing techniques into the method to further enhance the retrieval of digital agricultural text information |
natural language processing (NLP) | (2022, Xu B et al.) [118] | The multimodal named entity recognition framework proposed in this paper solves the limitation of strict matching’s in traditional approach through a cross-modal matching and alignment mechanism, and effectively fuses the semantic information of text and image. The specially designed cross-modal matching (CM) module dynamically adjusts the utilization of visual information according to the similarity between text and image, while the cross-modal alignment (CA) module narrows the semantic gap between the two modalities and improves the recognition accuracy. | The limitations of the framework are mainly in terms of dataset dependency, high model complexity, and limited interpretability. Specific datasets may not fully reflect the complexity of real social media data, model complexity may affect the efficiency of handling large-scale data, and the opacity of the model decision-making process increases the difficulty of user understanding and trust. | Future research can explore the directions of multimodal data extension, dynamic parameter adjustment, contextual information fusion, and model interpretability enhancement. By combining research in these directions with existing frameworks, the performance of multimodal named entity recognition can be further improved and its applications in areas such as social media can be expanded. |
(2018, Zhu G et al.) [105] | This paper proposes two innovative approaches, SCSNED and Category2Vec, to improve the performance of entity disambiguation by utilizing semantic similarity in knowledge graphs and jointly learned embedding models, respectively. These two approaches are validated in real-world datasets. | Although the proposed method is effective, the aspects of universality, accuracy of semantic similarity computation, and real-time performance still need to be further verified and optimized. | Future research could explore more advanced semantic representations, cross-domain cross-language disambiguation techniques, optimization of real-time performance, and interactive disambiguation methods incorporating user feedback. | |
(2024, Gong J et al.) [119] | A unified framework named MORE is proposed to effectively solve the over-segmentation and over-merging problems in the author name disambiguation (AND) task. By combining LightGBM, HAC, and improved HGAT (iHGAT) techniques, and introducing the SimCSE algorithm to optimize the paper representation, MORE significantly improves the performance on the AND task. | (1) Complexity of academic domains: existing author name disambiguation methods have limitations in dealing with the complexity and diversity of academic domains, especially when dealing with special or complex author names. (2) Maintenance of knowledge graph: with the growth and update of academic data, it is a challenge to effectively maintain and update the knowledge graph in terms of timeliness and accuracy. (3) Granularity of disambiguation: current research focuses on disambiguation at the paper level, with less attention paid to disambiguation at finer granularity, such as at the paragraph or sentence level. | (1) Technological innovation: explore and apply more advanced natural language processing and machine learning techniques to improve the accuracy and efficiency of author name disambiguation. (2) Dynamic updating of knowledge graph: to study how to construct and maintain a knowledge graph that can be dynamically updated and expanded to adapt to changes in academic data. (3) Fine-grained disambiguation: in-depth study of fine-grained disambiguation methods to improve the ability to identify and handle author name conflicts in fine-grained text analysis tasks. (4) Practical activities and competitions: promote the development of related technologies and encourage researchers to explore new methods and ideas by conducting practical activities such as online knowledge alignment competitions. | |
(2011, Arifoğlu D et al.) [120] | Two innovative techniques are proposed for solving difficult problems in historical document analysis. One is a cross-document word-matching-based approach, which realizes the automatic segmentation of words in damaged and deformed historical documents; the other is an image analysis method based online and sub-pattern detection, which has been successfully applied to pattern matching and duplicate pattern detection in images of Islamic calligraphy (Kufic script). These two techniques not only provide new tools for the study of historical documents, but also expand the technical boundaries of the document analysis field. | Although certain results have been achieved in word segmentation of historical documents and pattern matching of Islamic calligraphy images, they are still facing problems such as limited recognition accuracy, strong dependence on data sources, difficulties in processing complex images, and algorithmic performance bottlenecks, which are challenges that limit the wide application and in-depth development of the existing technologies. | Future research will be dedicated to improving the accuracy and efficiency of historical document recognition and Islamic calligraphy image analysis through technological innovation and algorithm optimization. At the same time, it is possible to explore the possibility of multimodal data fusion, develop user-friendly interactive tools, and expand the application of the research results to the digitization of ancient books and the preservation of cultural heritage to promote the progress and development of related disciplines. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, P.; Cai, X. A Survey of Text-Matching Techniques. Information 2024, 15, 332. https://doi.org/10.3390/info15060332
Jiang P, Cai X. A Survey of Text-Matching Techniques. Information. 2024; 15(6):332. https://doi.org/10.3390/info15060332
Chicago/Turabian StyleJiang, Peng, and Xiaodong Cai. 2024. "A Survey of Text-Matching Techniques" Information 15, no. 6: 332. https://doi.org/10.3390/info15060332
APA StyleJiang, P., & Cai, X. (2024). A Survey of Text-Matching Techniques. Information, 15(6), 332. https://doi.org/10.3390/info15060332