Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation
<p>Construction of word vector space and text similarity calculation.</p> "> Figure 2
<p>The similarity between two relevant sentences. (<b>a</b>) Before KLD optimization; (<b>b</b>) after KLD optimization.</p> "> Figure 3
<p>The similarity between two irrelevant sentences. (<b>a</b>) Before KLD optimization; (<b>b</b>) after KLD optimization.</p> "> Figure 4
<p>Knowledge distillation model.</p> "> Figure 5
<p>Comprehensive analysis of loss and metrics over training rounds.</p> ">
Abstract
:1. Introduction
- We propose a novel KLD-enhanced word vector method that uses KLD as a metric tool and iterative optimization to generate vectors with better semantic differentiation. This approach improves semantic feature representation in practical contexts. However, it has limitations in handling context dependency and polysemy, which are better addressed by deep learning methods. This motivated the development of a second method to overcome these challenges.
- We present a RoBERTa-based knowledge distillation [3] framework enhanced with Dynamic Principal Component Smoothing (DPCS). This framework combines RoBERTa’s deep semantic insights with traditional embedding techniques, transferring contextual knowledge to improve semantic fidelity and computational efficiency. DPCS further refines sentence representations by adapting principal components to contextual nuances, enhancing adaptability and precision across multiple semantic layers. This method excels in textual similarity tasks, demonstrating superior robustness and generalization in complex NLP applications.
2. Related Work
2.1. Word Vector Models
2.2. Methods for Semantic Text Similarity Calculation
3. Introduction of the Algorithm
3.1. Sentence Embedding Generation Method Based on KLD Optimization
Algorithm 1. KL divergence matrix computation. |
Require: stop words, word vectors, sentences, TF-IDF weights Ensure: Updated word vectors saved to a file Objective: Minimize Loss
|
3.2. Sentence Embedding Generation Framework Based on RoBERTa Knowledge Distillation
Algorithm 2. Model to train and save embeddings. |
Input: Sentences S, soft labels L, save path p Output: Trained Embeddings model saved to p
|
- Obtaining Soft Labels for Sentences from the Pre-trained Language Model: Use the pre-trained language model RoBERTa to generate hidden layer representations of sentences and calculate the soft labels. Given an input sentence S, tokenize it into word IDs, and then process it through the pre-trained model to obtain hidden layer outputs H. In Equation (2), represents the hidden layer representation of the i-th word:To obtain the sentence’s embedding representation, compute the average of hidden layer outputs. In Equation (3), is the soft label of the sentence, representing the vector representation of the sentence:
- Preprocessing Sentences to Match the Input Requirements of the Above Model: Preprocess input sentences by removing stop words and tokenizing. Then, average or weighted-average the word vector representations. For a preprocessed sentence , its vector can be calculated as shown in Equation (4), where is the number of words in the sentence, and is the word vector of generated by the KLD expansion algorithm:
- Training the Above Embedding Model Using Soft Labels: Match the preprocessed representation of the sentence with the soft labels and use the mean square error loss function to train the embedding model. The goal is to minimize the mean square error between the outputs of the embedding model, as shown in Equation (5), where is the number of training samples, is the student model’s predicted output for the sentence , and is the soft label generated by the teacher model:
- Generating New Sentence Embeddings Using the Trained Embedding Model: In Equation (6), is the vector representation obtained after preprocessing the new sentence , and is the trained embedding model.Finally, the Dynamic Principal Component Smoothing (DPCS) algorithm is employed to compute the vector representation of text. This method, by combining the inverse frequency of words and weighted averaging, first calculates the weight of each word and generates a weighted average vector for the sentence. Unlike the traditional approach of simply averaging the word embeddings in a sentence, DPCS dynamically selects and retains the principal components with higher contribution rates, effectively removing low-variance components that are more likely to be noisy. As a result, DPCS preserves the main semantic information of the sentence while reducing the interference from completely irrelevant or noisy words, thereby enhancing the quality of the text representation. Ultimately, the similarity between sentences is computed using cosine similarity.Compared to the traditional smoothing inverse frequency (SIF) method, DPCS offers significant advantages when handling high-dimensional word embeddings, particularly in its dynamic removal of low-variance components. This approach enables a more precise focus on the core semantic structure of the sentence, avoiding the over-weighting of irrelevant words as seen with simple averaging. The main steps of DPCS include calculating word weights, computing the weighted average vector of the sentence, dynamic principal component removal, and calculating cosine similarity.
- Calculating Word Weights: The weight of each word is calculated based on its word frequency P and a smoothing parameter , as shown in Equation (7). is a small adjustment parameter, usually set to 0.001.
- Calculating the Weighted Average Vector of the Sentence: Compute the weighted average of the word vectors in the sentence to obtain the representation vector , as shown in Equation (8), where is the word vector of each word, and is the number of words in the sentence :
- Dynamic Principal Component Removal: To eliminate common directions in the text representation, Principal Component Analysis (PCA) can be applied. In PCA, we calculate the eigenvalues of each principal component to measure the variance explained by each component. The variance contribution rate of the j-th principal component is given by Equation (9):To select an appropriate number of principal components, a threshold , such as 95%, is typically set to ensure that the cumulative variance contribution reaches this threshold, as shown in Equation (11):When the cumulative variance contribution meets or exceeds the threshold , the first k principal components are selected. Based on the variance contributions calculated in the previous steps, we select the first k principal components, ensuring that their cumulative variance contribution meets or exceeds the predefined threshold . These selected components are retained, while the remaining components (which account for low variance or noise) are discarded. Subsequently, after removing the low-variance components, the adjusted sentence vector is obtained as shown in Equation (12):
- Calculating Cosine Similarity: As shown in Equation (13), for two sentences and , compute the cosine similarity of their vector representations and . represents the dot product of the two vectors, and and represent the magnitudes of the two vectors (i.e., the Euclidean norms of the vectors):
4. Experiments
4.1. Experiment Preparation
4.2. Related Algorithms
4.3. Comparative Experiments and Results
- Precision measures the proportion of correctly identified positive instances out of all instances predicted as positive. It is particularly useful when the cost of false positives is high, as it ensures that only relevant positive predictions are made.
- Recall, on the other hand, evaluates the proportion of correctly identified positive instances out of all actual positive instances. It is crucial when the cost of false negatives is high, as it ensures that most of the true positives are correctly identified.
- F1 score is the harmonic mean of precision and recall, offering a balanced measure that accounts for both false positives and false negatives. This metric is especially useful when the dataset is imbalanced, as it provides a more comprehensive view of model performance by combining the strengths of precision and recall into a single value.
- Pearson correlation coefficient: This statistic measures the strength and direction of the linear relationship between two continuous variables. Its range is between −1 and 1, where 1 indicates a perfect positive correlation, −1 indicates a perfect negative correlation, and 0 indicates no linear correlation. τ is generally used to measure the correlation between two ordinal variables, as shown in Equation (14), where and are the observed values of variables and , and and are the means of variables and , respectively.
- Spearman correlation coefficient: This coefficient measures the strength and direction of the monotonic relationship between two variables, not requiring a linear relationship. It is calculated based on the ranks of the variables, and its range is also between −1 and 1. Values close to 1 or −1 indicate strong monotonic positive or negative correlation, respectively, and 0 indicates no monotonic relationship, as shown in Equation (15), where is the difference in ranks of the i-th pair of observations, and is the number of observations:
- Mean absolute error (MAE): This metric measures the difference between predicted and actual values in a regression model. It calculates the average of the absolute differences between predicted and actual values.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Afzal, N.; Fu, S.; Wang, L.; Shen, F.; Rastegar-Mojarad, M.; Liu, H. MedSTS: A Resource for Clinical Semantic Textual Similarity. Lang Resour. Eval. 2020, 54, 57–72. [Google Scholar] [CrossRef]
- Kang, Y.; Cai, Z.; Tan, C.-W.; Huang, Q.; Liu, H. Natural Language Processing (NLP) in Management Research: A Literature Review. J. Manag. Anal. 2020, 7, 139–172. [Google Scholar] [CrossRef]
- Wang, L.; Yoon, K.-J. Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3048–3068. [Google Scholar] [CrossRef] [PubMed]
- Yao, T.; Zhai, Z.; Gao, B. Text Classification Model Based on fastText. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS), Dalian, China, 20–22 March 2020; pp. 154–157. [Google Scholar]
- Bekamiri, H.; Hain, D.S.; Jurowetzki, R. PatentSBERTa: A Deep NLP Based Hybrid Model for Patent Distance and Classification Using Augmented SBERT. Technol. Forecast. Soc. Chang. 2024, 206, 123536. [Google Scholar] [CrossRef]
- Mokoatle, M.; Marivate, V.; Mapiye, D.; Bornman, R.; Hayes, V.M. A Review and Comparative Study of Cancer Detection Using Machine Learning: SBERT and SimCSE Application. BMC Bioinform. 2023, 24, 112. [Google Scholar] [CrossRef]
- Alammary, A.S. Arabic Questions Classification Using Modified TF-IDF. IEEE Access 2021, 9, 95109–95122. [Google Scholar] [CrossRef]
- Kirişci, M. New Cosine Similarity and Distance Measures for Fermatean Fuzzy Sets and TOPSIS Approach. Knowl. Inf. Syst. 2023, 65, 855–868. [Google Scholar] [CrossRef]
- Li, Z.; Yang, Y.; Li, L.; Wang, D. A Weighted Pearson Correlation Coefficient Based Multi-Fault Comprehensive Diagnosis for Battery Circuits. J. Energy Storage 2023, 60, 106584. [Google Scholar] [CrossRef]
- Chatterjee, S. A New Coefficient of Correlation. J. Am. Stat. Assoc. 2021, 116, 2009–2022. [Google Scholar] [CrossRef]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Lin, D., Matsumoto, Y., Mihalcea, R., Eds.; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 142–150. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Maiese, A.; Santoro, P.; La Russa, R.; De Matteis, A.; Turillazzi, E.; Frati, P.; Fineschi, V. Crossbow Injuries: A Case Report with Experimental Reconstruction Study and a Systematic Review of Literature. J. Forensic Leg. Med. 2021, 79, 102147. [Google Scholar] [CrossRef] [PubMed]
- Du, X.; Yan, J.; Zhang, R.; Zha, H. Cross-Network Skip-Gram Embedding for Joint Network Alignment and Link Prediction. IEEE Trans. Knowl. Data Eng. 2022, 34, 1080–1095. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. FastText.Zip: Compressing Text Classification Models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
- Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting Formal Thought Disorder by Deep Contextualized Word Representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Suwanda, R.; Syahputra, Z.; Zamzami, E.M. Analysis of Euclidean Distance and Manhattan Distance in the K-Means Algorithm for Variations Number of Centroid K. J. Phys. Conf. Ser. 2020, 1566, 012058. [Google Scholar] [CrossRef]
- Fernando, B.; Herath, S. Anticipating Human Actions by Correlating Past with the Future with Jaccard Similarity Measures. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 13219–13228. [Google Scholar]
- Chakraborty, S.; Ding, Y.; Allamanis, M.; Ray, B. CODIT: Code Editing with Tree-Based Neural Models. IEEE Trans. Softw. Eng. 2022, 48, 1385–1399. [Google Scholar] [CrossRef]
- Chen, W.; Li, Y.; Tsangaratos, P.; Shahabi, H.; Ilia, I.; Xue, W.; Bian, H. Groundwater Spring Potential Mapping Using Artificial Intelligence Approach Based on Kernel Logistic Regression, Random Forest, and Alternating Decision Tree Models. Appl. Sci. 2020, 10, 425. [Google Scholar] [CrossRef]
- Huang, G.; Quo, C.; Kusner, M.J.; Sun, Y.; Weinberger, K.Q.; Sha, F. Supervised Word Mover’s Distance. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 4869–4877. [Google Scholar]
- Arora, S.; Liang, Y.; Ma, T. A Simple But Tough-To-Beat Baseline for Sentence Embeddings. 6 February 2017. Available online: https://dblp.org/rec/conf/iclr/AroraLM17.html (accessed on 11 October 2024).
- Bu, Y.; Zou, S.; Liang, Y.; Veeravalli, V.V. Estimation of KL Divergence: Optimal Minimax Rate. arXiv 2018, arXiv:1607.02653. [Google Scholar] [CrossRef]
- Namoun, A.; Alshanqiti, A. Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review. Appl. Sci. 2021, 11, 237. [Google Scholar] [CrossRef]
- Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased Mean Teacher for Cross-Domain Object Detection. arXiv 2021, arXiv:2003.00707. [Google Scholar]
- Agirre, E.; Banea, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Mihalcea, R.; Rigau, G.; Wiebe, J. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; Bethard, S., Carpuat, M., Cer, D., Jurgens, D., Nakov, P., Zesch, T., Eds.; Association for Computational Linguistics: San Diego, CA, USA, 2016; pp. 497–511. [Google Scholar]
Sentence Pair | Before Optimization | After Optimization |
---|---|---|
A man puts three pieces of meat into a pan. A man is putting meat in a pan. | 0.13 | 0.72 |
Yes, there is a rule against this. There’s no rule against it. | 0.74 | 0.20 |
Sentence Pair | Similarity | Baseline Evaluation | MyModel Evaluation |
---|---|---|---|
1. Suicide attack kills eight in Baghdad 2. Suicide attacks kill 24 people in Baghdad | 2.40 | 2.12 | 2.41 |
1. Ukraine to implement unilateral ceasefire 2. Ukraine offers unilateral ceasefire | 4.80 | 4.56 | 4.78 |
1. Beaten Florida teen released in Israel 2. Palestinian teen dies of wounds sustained in Israeli shooting | 0.4 | 0.33 | 0.39 |
1. Southwest jet hit nose first 2. Southwest Jet’s Nose Gear Landed First | 3.6 | 3.45 | 3.56 |
Embedding | Precision | Recall | F1 Score |
---|---|---|---|
Word2Vec | 0.66 | 0.02 | 0.04 |
GloVe | 0.73 | 0.77 | 0.75 |
FastText | 0.51 | 0.85 | 0.64 |
BERT | 0.71 | 0.82 | 0.76 |
SBERT | 0.67 | 0.54 | 0.60 |
SimCSE | 0.65 | 0.81 | 0.72 |
MyModel | 0.75 | 0.88 | 0.81 |
Embedding | τ | ρ | MAE |
---|---|---|---|
Word2Vec | 0.044 | 0.379 | 2.313 |
GloVe | 0.368 | 0.356 | 2.256 |
FastText | 0.064 | 0.472 | 2.447 |
BERT | 0.451 | 0.457 | 2.181 |
SBERT | 0.462 | 0.482 | 2.093 |
SimCSE | 0.458 | 0.490 | 2.104 |
MyModel | 0.470 | 0.481 | 2.100 |
Embedding | τ | ρ | MAE |
---|---|---|---|
Word2Vec | 0.045 | 0.387 | 2.315 |
GloVe | 0.473 | 0.460 | 2.046 |
FastText | 0.085 | 0.364 | 2.340 |
BERT | 0.486 | 0.492 | 2.057 |
SBERT | 0.516 | 0.511 | 1.672 |
SimCSE | 0.507 | 0.503 | 1.947 |
MyModel | 0.528 | 0.518 | 1.343 |
Embedding | τ | ρ | MAE |
Word2Vec | 0.444 | 0.432 | 1.380 |
GloVe | 0.436 | 0.421 | 1.423 |
FastText | 0.474 | 0.467 | 1.376 |
BERT | 0.451 | 0.457 | 2.081 |
SBERT | 0.511 | 0.501 | 1.390 |
SimCSE | 0.525 | 0.514 | 1.434 |
MyModel | 0.530 | 0.518 | 1.320 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Han, J.; Yang, L. Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation. Mathematics 2024, 12, 3990. https://doi.org/10.3390/math12243990
Han J, Yang L. Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation. Mathematics. 2024; 12(24):3990. https://doi.org/10.3390/math12243990
Chicago/Turabian StyleHan, Jin, and Liang Yang. 2024. "Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation" Mathematics 12, no. 24: 3990. https://doi.org/10.3390/math12243990
APA StyleHan, J., & Yang, L. (2024). Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation. Mathematics, 12(24), 3990. https://doi.org/10.3390/math12243990