[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (596)

Search Parameters:
Keywords = text vectorization

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 1651 KiB  
Article
Sentiment Analysis of Product Reviews Using Machine Learning and Pre-Trained LLM
by Pawanjit Singh Ghatora, Seyed Ebrahim Hosseini, Shahbaz Pervez, Muhammad Javed Iqbal and Nabil Shaukat
Big Data Cogn. Comput. 2024, 8(12), 199; https://doi.org/10.3390/bdcc8120199 - 23 Dec 2024
Viewed by 502
Abstract
Sentiment analysis via artificial intelligence, i.e., machine learning and large language models (LLMs), is a pivotal tool that classifies sentiments within texts as positive, negative, or neutral. It enables computers to automatically detect and interpret emotions from textual data, covering a spectrum of [...] Read more.
Sentiment analysis via artificial intelligence, i.e., machine learning and large language models (LLMs), is a pivotal tool that classifies sentiments within texts as positive, negative, or neutral. It enables computers to automatically detect and interpret emotions from textual data, covering a spectrum of feelings without direct human intervention. Sentiment analysis is integral to marketing research, helping to gauge consumer emotions and opinions across various sectors. Its applications span analyzing movie reviews, monitoring social media, evaluating product feedback, assessing employee sentiments, and identifying hate speech. This study explores the application of both traditional machine learning and pre-trained LLMs for automated sentiment analysis of customer product reviews. The motivation behind this work lies in the demand for more nuanced understanding of consumer sentiments that can drive data-informed business decisions. In this research, we applied machine learning-based classifiers, i.e., Random Forest, Naive Bayes, and Support Vector Machine, alongside the GPT-4 model to benchmark their effectiveness for sentiment analysis. Traditional models show better results and efficiency in processing short, concise text, with SVM in classifying sentiment of short length comments. However, GPT-4 showed better results with more detailed texts, capturing subtle sentiments with higher precision, recall, and F1 scores to uniquely identify mixed sentiments not found in the simpler models. Conclusively, this study shows that LLMs outperform traditional models in context-rich sentiment analysis by not only providing accurate sentiment classification but also insightful explanations. These results enable LLMs to provide a superior tool for customer-centric businesses, which helps actionable insights to be derived from any textual data. Full article
Show Figures

Figure 1

Figure 1
<p>High level strategy.</p>
Full article ">Figure 2
<p>Proposed methodology.</p>
Full article ">Figure 3
<p>Dataset record counts split by sentiment.</p>
Full article ">Figure 4
<p>Dataset record counts split by sentiment without null records.</p>
Full article ">Figure 5
<p>Dataset record counts after addressing bias.</p>
Full article ">Figure 6
<p>LLM classification on Summary feature.</p>
Full article ">
15 pages, 12297 KiB  
Article
Enhancing Accessibility: Automated Tactile Graphics Generation for Individuals with Visual Impairments
by Yehor Dzhurynskyi, Volodymyr Mayik and Lyudmyla Mayik
Computation 2024, 12(12), 251; https://doi.org/10.3390/computation12120251 - 23 Dec 2024
Viewed by 288
Abstract
This study addresses the accessibility challenges faced by individuals with visual impairments due to limited access to graphic information, which significantly impacts their educational and social integration. Traditional methods for producing tactile graphics are labor-intensive and require specialized expertise, limiting their availability. Recent [...] Read more.
This study addresses the accessibility challenges faced by individuals with visual impairments due to limited access to graphic information, which significantly impacts their educational and social integration. Traditional methods for producing tactile graphics are labor-intensive and require specialized expertise, limiting their availability. Recent advancements in generative models, such as GANs, diffusion models, and VAEs, offer potential solutions to automate the creation of tactile images. In this work, we propose a novel generative model conditioned on text prompts, integrating a Bidirectional and Auto-Regressive Transformer (BART) and Vector Quantized Variational Auto-Encoder (VQ-VAE). This model transforms textual descriptions into tactile graphics, addressing key requirements for legibility and accessibility. The model’s performance was evaluated using cross-entropy, perplexity, mean square error, and CLIP Score metrics, demonstrating its ability to generate high-quality, customizable tactile images. Testing with educational and rehabilitation institutions confirmed the practicality and efficiency of the system, which significantly reduces production time and requires minimal operator expertise. The proposed approach enhances the production of inclusive educational materials, enabling improved access to quality education and fostering greater independence for individuals with visual impairments. Future research will focus on expanding the training dataset and refining the model for complex scenarios. Full article
(This article belongs to the Special Issue Artificial Intelligence Applications in Public Health)
Show Figures

Figure 1

Figure 1
<p>Structural and functional diagram of the text-conditioned tactile graphics generative model.</p>
Full article ">Figure 2
<p>Samples of generated images, each determined by the text prompt provided below the corresponding image (note that, in practice, text prompts were provided in Ukrainian, but they were translated for convenience).</p>
Full article ">Figure 2 Cont.
<p>Samples of generated images, each determined by the text prompt provided below the corresponding image (note that, in practice, text prompts were provided in Ukrainian, but they were translated for convenience).</p>
Full article ">Figure 3
<p>Reproduction of image samples synthesized using the developed software on specialized heat-sensitive capsule paper (scan).</p>
Full article ">
16 pages, 2833 KiB  
Article
MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation
by Jianqiang Zhang, Renyao Chen, Shengwen Li, Tailong Li and Hong Yao
Algorithms 2024, 17(12), 593; https://doi.org/10.3390/a17120593 - 23 Dec 2024
Viewed by 313
Abstract
Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors [...] Read more.
Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors of entities and their relationships from their spatial attributes and relationships, which ignores various semantics of entities, resulting in poor embeddings on geographic knowledge graphs. This study proposes a two-stage multimodal geographic knowledge graph representation (MGKGR) model that integrates multiple kinds of semantics to improve the embedding learning of geographic knowledge graph representation. Specifically, in the first stage, a spatial feature fusion method for modality enhancement is proposed to combine the structural features of geographic knowledge graphs with two modal semantic features. In the second stage, a multi-level modality feature fusion method is proposed to integrate heterogeneous features from different modalities. By fusing the semantics of text and images, the performance of geographic knowledge graph representation is improved, providing accurate representations for downstream geographic intelligence tasks. Extensive experiments on two datasets show that the proposed MGKGR model outperforms the baselines. Moreover, the results demonstrate that integrating textual and image data into geographic knowledge graphs can effectively enhance the model’s performance. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

Figure 1
<p>Multimodal data in the geographic knowledge graph provides semantic information for geographic attribute prediction.</p>
Full article ">Figure 2
<p>The framework of the proposed MGKGR. (<b>A</b>) Multimodal GeoKG Encoding module processes the multimodal data of multimodal GeoKG for effective encoding. (<b>B</b>) Two-Stage Multimodal Feature Fusion module integrates features from multiple modalities to generate the multimodal features of multimodal GeoKG.</p>
Full article ">Figure 3
<p>Model performance on attribute relations, adjacency relations, and mixed relations.</p>
Full article ">
21 pages, 1728 KiB  
Article
Sentence Embedding Generation Framework Based on Kullback–Leibler Divergence Optimization and RoBERTa Knowledge Distillation
by Jin Han and Liang Yang
Mathematics 2024, 12(24), 3990; https://doi.org/10.3390/math12243990 - 18 Dec 2024
Viewed by 535
Abstract
In natural language processing (NLP) tasks, computing semantic textual similarity (STS) is crucial for capturing nuanced semantic differences in text. Traditional word vector methods, such as Word2Vec and GloVe, as well as deep learning models like BERT, face limitations in handling context dependency [...] Read more.
In natural language processing (NLP) tasks, computing semantic textual similarity (STS) is crucial for capturing nuanced semantic differences in text. Traditional word vector methods, such as Word2Vec and GloVe, as well as deep learning models like BERT, face limitations in handling context dependency and polysemy and present challenges in computational resources and real-time processing. To address these issues, this paper introduces two novel methods. First, a sentence embedding generation method based on Kullback–Leibler Divergence (KLD) optimization is proposed, which enhances semantic differentiation between sentence vectors, thereby improving the accuracy of textual similarity computation. Second, this study proposes a framework incorporating RoBERTa knowledge distillation, which integrates the deep semantic insights of the RoBERTa model with prior methodologies to enhance sentence embeddings while preserving computational efficiency. Additionally, the study extends its contributions to sentiment analysis tasks by leveraging the enhanced embeddings for classification. The sentiment analysis experiments, conducted using a Stochastic Gradient Descent (SGD) classifier on the ACL IMDB dataset, demonstrate the effectiveness of the proposed methods, achieving high precision, recall, and F1 score metrics. To further augment model accuracy and efficacy, a feature selection approach is introduced, specifically through the Dynamic Principal Component Selection (DPCS) algorithm. The DPCS method autonomously identifies and prioritizes critical features, thus enriching the expressive capacity of sentence vectors and significantly advancing the accuracy of similarity computations. Experimental results demonstrate that our method outperforms existing methods in semantic similarity computation on the SemEval-2016 dataset. When evaluated using cosine similarity of average vectors, our model achieved a Pearson correlation coefficient (τ) of 0.470, a Spearman correlation coefficient (ρ) of 0.481, and a mean absolute error (MAE) of 2.100. Compared to traditional methods such as Word2Vec, GloVe, and FastText, our method significantly enhances similarity computation accuracy. Using TF-IDF-weighted cosine similarity evaluation, our model achieved a τ of 0.528, ρ of 0.518, and an MAE of 1.343. Additionally, in the cosine similarity assessment leveraging the Dynamic Principal Component Smoothing (DPCS) algorithm, our model achieved a τ of 0.530, ρ of 0.518, and an MAE of 1.320, further demonstrating the method’s effectiveness and precision in handling semantic similarity. These results indicate that our proposed method has high relevance and low error in semantic textual similarity tasks, thereby better capturing subtle semantic differences between texts. Full article
Show Figures

Figure 1

Figure 1
<p>Construction of word vector space and text similarity calculation.</p>
Full article ">Figure 2
<p>The similarity between two relevant sentences. (<b>a</b>) Before KLD optimization; (<b>b</b>) after KLD optimization.</p>
Full article ">Figure 3
<p>The similarity between two irrelevant sentences. (<b>a</b>) Before KLD optimization; (<b>b</b>) after KLD optimization.</p>
Full article ">Figure 4
<p>Knowledge distillation model.</p>
Full article ">Figure 5
<p>Comprehensive analysis of loss and metrics over training rounds.</p>
Full article ">
18 pages, 7233 KiB  
Article
An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
by Qi Chen, Weifeng Zhou, Jian Cheng and Ji Yang
Appl. Sci. 2024, 14(24), 11529; https://doi.org/10.3390/app142411529 - 11 Dec 2024
Viewed by 537
Abstract
Large language model (LLM) processing, with natural language as its core, carries out information retrieval through intelligent Q&A. It has a wide range of application scenarios and is commonly considered a kind of generative AI. However, when LLMs handle generation tasks, the results [...] Read more.
Large language model (LLM) processing, with natural language as its core, carries out information retrieval through intelligent Q&A. It has a wide range of application scenarios and is commonly considered a kind of generative AI. However, when LLMs handle generation tasks, the results generated by fundamental LLMs with an insufficient comprehensive performance, specifically in the vertical domain, are often inaccurate due to a poor generalization ability, resulting in the so-called “hallucination” phenomenon. To solve these problems, in this study, an enhanced retrieval scheme for LLM processing was developed, named the BM-RAGAM (BM25 retrieval-augmented generation attention mechanism), by constructing a vectorized knowledge base, utilizing a hybrid joint retrieval strategy of keyword matching through searching and a semantic-enhanced association with an attention mechanism and taking ocean-front- and eddy-related knowledge in oceanography as an example. This scheme realized accurate word-based matching with the BM25 algorithm and text generation through a semantic-enhanced association using RAG, and it was used to construct a vector database of the text knowledge on ocean fronts and eddies. The output was compared and analyzed with the fundamental LLM of Qwen2-72B using the proposed scheme, and an ablation experiment was conducted. The results show that the proposed scheme greatly reduced hallucination generation in the process of text generation, making its outputs more interpretable. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>The development trend of LLMs.</p>
Full article ">Figure 2
<p>BM-RAGAM retrieval scheme.</p>
Full article ">Figure 3
<p>RAG enhancement process.</p>
Full article ">Figure 4
<p>The retrieval process of the BM25.</p>
Full article ">Figure 5
<p>Cloud deployment of the BM-RAGAM for the knowledge retrieval and generation system.</p>
Full article ">Figure A1
<p>Comparison of the Qwen2-72B basic model and the BM-RAGAM for the same problem, presented in Chinese and English (the blue box is the Q&amp;A based on Qwen2-72B, and the orange box is the Q&amp;A based on the BM-RAGAM).</p>
Full article ">Figure A1 Cont.
<p>Comparison of the Qwen2-72B basic model and the BM-RAGAM for the same problem, presented in Chinese and English (the blue box is the Q&amp;A based on Qwen2-72B, and the orange box is the Q&amp;A based on the BM-RAGAM).</p>
Full article ">
18 pages, 343 KiB  
Article
Comparative Investigation of Traditional Machine-Learning Models and Transformer Models for Phishing Email Detection
by René Meléndez, Michal Ptaszynski and Fumito Masui
Electronics 2024, 13(24), 4877; https://doi.org/10.3390/electronics13244877 - 11 Dec 2024
Viewed by 645
Abstract
Phishing emails pose a significant threat to cybersecurity worldwide. There are already tools that mitigate the impact of these emails by filtering them, but these tools are only as reliable as their ability to detect new formats and techniques for creating phishing emails. [...] Read more.
Phishing emails pose a significant threat to cybersecurity worldwide. There are already tools that mitigate the impact of these emails by filtering them, but these tools are only as reliable as their ability to detect new formats and techniques for creating phishing emails. In this paper, we investigated how traditional models and transformer models work on the classification task of identifying if an email is phishing or not. We realized that transformer models, in particular distilBERT, BERT, and roBERTa, had a significantly higher performance compared to traditional models like Logistic Regression, Random Forest, Support Vector Machine, and Naive Bayes. The process consisted of using a large and robust dataset of emails and applying preprocessing and optimization techniques to maximize the best result possible. roBERTa showed an outstanding capacity to identify phishing emails by achieving a maximum accuracy of 0.9943. Even though they were still successful, traditional models performed marginally worse; SVM performed the best, with an accuracy of 0.9876. The results emphasize the value of sophisticated text-processing methods and the potential of transformer models to improve email security by thwarting phishing attempts. Full article
(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)
Show Figures

Figure 1

Figure 1
<p>Distribution of the dataset. 57.0% of these emails are flagged as phishing.</p>
Full article ">Figure 2
<p>Distribution of emails with their sources.</p>
Full article ">Figure 3
<p>ROC curve for traditional models.</p>
Full article ">Figure 4
<p>ROC curve for transformer models.</p>
Full article ">Figure 5
<p>Number of misclassifications per model on phishing email classification.</p>
Full article ">
19 pages, 528 KiB  
Article
Enhancing Word Embeddings for Improved Semantic Alignment
by Julian Szymański, Maksymilian Operlejn and Paweł Weichbroth
Appl. Sci. 2024, 14(24), 11519; https://doi.org/10.3390/app142411519 - 10 Dec 2024
Viewed by 539
Abstract
This study introduces a method for the improvement of word vectors, addressing the limitations of traditional approaches like Word2Vec or GloVe through introducing into embeddings richer semantic properties. Our approach leverages supervised learning methods, with shifts in vectors in the representation space enhancing [...] Read more.
This study introduces a method for the improvement of word vectors, addressing the limitations of traditional approaches like Word2Vec or GloVe through introducing into embeddings richer semantic properties. Our approach leverages supervised learning methods, with shifts in vectors in the representation space enhancing the quality of word embeddings. This ensures better alignment with semantic reference resources, such as WordNet. The effectiveness of the method has been demonstrated through the application of modified embeddings to text classification and clustering. We also show how our method influences document class distributions, visualized through PCA projections. By comparing our results with state-of-the-art approaches and achieving better accuracy, we confirm the effectiveness of the proposed method. The results underscore the potential of adaptive embeddings to improve both the accuracy and efficiency of semantic analysis across a range of NLP. Full article
Show Figures

Figure 1

Figure 1
<p>Classification results for the nearest neighbor method.</p>
Full article ">Figure 2
<p>Classification results for the random forest classifier with mean sentence embedding.</p>
Full article ">Figure 3
<p>PCA visualization of original embeddings.</p>
Full article ">Figure 4
<p>PCA visualization of neural embeddings.</p>
Full article ">Figure 5
<p>PCA visualization of fine-tuned embeddings.</p>
Full article ">Figure 6
<p>PCA visualization of geometrical embeddings.</p>
Full article ">Figure 7
<p>Results for usage of mean category distance.</p>
Full article ">Figure 8
<p>Results for mean category distance by category.</p>
Full article ">Figure 9
<p>Results for Category Density.</p>
Full article ">Figure 10
<p>Results for category density by category.</p>
Full article ">
15 pages, 1457 KiB  
Article
A Chinese Short Text Similarity Method Integrating Sentence-Level and Phrase-Level Semantics
by Zhenji Shen and Zhiyong Xiao
Electronics 2024, 13(24), 4868; https://doi.org/10.3390/electronics13244868 - 10 Dec 2024
Viewed by 412
Abstract
Short text similarity, as a pivotal research domain within Natural Language Processing (NLP), has been extensively utilized in intelligent search, recommendation systems, and question-answering systems. Most existing short-text similarity models focus on aligning the overall semantic content of an entire sentence, often ignoring [...] Read more.
Short text similarity, as a pivotal research domain within Natural Language Processing (NLP), has been extensively utilized in intelligent search, recommendation systems, and question-answering systems. Most existing short-text similarity models focus on aligning the overall semantic content of an entire sentence, often ignoring the semantic associations between individual phrases in the sentence. It is particular in the Chinese context, as synonyms and near-synonyms can cause serious interference in the computation of text similarity. To overcome these limitations, a novel short text similarity computation method integrating both sentence-level and phrase-level semantics was proposed. By harnessing vector representations of Chinese words/phrases as external knowledge, this approach amalgamates global sentence characteristics with local phrase features to compute short text similarity from diverse perspectives, spanning from the global to the local level. Experimental results demonstrate that the proposed model outperforms previous methods in the Chinese short text similarity task. Specifically, the model achieves an accuracy of 90.16% in LCQMC, which is 2.23% and 1.46%, respectively, better than ERNIE and Glyce + BERT. Full article
Show Figures

Figure 1

Figure 1
<p>Overall architecture diagram.</p>
Full article ">Figure 2
<p>External knowledge processing.</p>
Full article ">Figure 3
<p>Feature fusion and similarity computation diagram.</p>
Full article ">
22 pages, 487 KiB  
Article
From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts
by Gergely Márk Csányi, Dorina Lakatos, István Üveges, Andrea Megyeri , János Pál Vadász, Dániel Nagy and Renátó Vági
Big Data Cogn. Comput. 2024, 8(12), 185; https://doi.org/10.3390/bdcc8120185 - 10 Dec 2024
Viewed by 570
Abstract
This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with [...] Read more.
This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with comparable facts using preliminary legal fact drafts. Evaluating such systems often poses significant challenges, given the need for thorough document checks, which can be costly and limit evaluation reusability. To address this, the study employs manually created fact drafts for legal cases, enabling reliable ranking of original cases within retrieved documents and quantitative comparison of various vectorization methods. The study compares twelve different text embedding solutions (the most recent became available just a few weeks before the manuscript was written) identifying Cohere’s embed-multilingual-v3.0, Beijing Academy of Artificial Intelligence’s bge-m3, Jina AI’s jina-embeddings-v3, OpenAI’s text-embedding-3-large, and Microsoft’s multilingual-e5-large models as top performers. To overcome the transformer-based models’ context window limitation, we investigated chunking, striding, and last chunk scaling techniques, with last chunk scaling significantly improving embedding quality. The results suggest that the effectiveness of striding varies based on token count. Notably, employing striding with 16 tokens yielded optimal results, representing 3.125% of the context window size for the best-performing models. Results also suggested that from the models having 8192 token long context window the bge-m3 model is superior compared to jina-embeddings-v3 and text-embedding-3-large models in capturing the relevant parts of a document if the text contains significant amount of noise. The validity of the approach was evaluated and confirmed by legal experts. These insights led to an operational semantic search system for a prominent legal content provider. Full article
Show Figures

Figure 1

Figure 1
<p>Effect of Last Chunk Scaling. Chunk 2 is the last chunk having only 128 tokens in a 512 token wide context window. Notice how much the direction of the average (document) vector alters by applying LCS (from light green to green vector). This mitigates overweighting of the shorter sequence.</p>
Full article ">Figure 2
<p>Architecture diagram of vectorization.</p>
Full article ">Figure 3
<p>Average MRR improvement of the different chunking strategies compared to the truncated vectors (multiplied by 100) on Facts.</p>
Full article ">Figure 4
<p>100-R@n values for the best 5 approaches.</p>
Full article ">
17 pages, 460 KiB  
Article
ML-Based Pain Recognition Model Using Mixup Data Augmentation
by Raghu M. Shantharam and Friedhelm Schwenker
Appl. Syst. Innov. 2024, 7(6), 124; https://doi.org/10.3390/asi7060124 - 9 Dec 2024
Viewed by 508
Abstract
Machine learning (ML) has revolutionized healthcare by enhancing diagnostic capabilities because of its ability to analyze large datasets and detect minor patterns often overlooked by humans. This is beneficial, especially in pain recognition, where patient communication may be limited. However, ML models often [...] Read more.
Machine learning (ML) has revolutionized healthcare by enhancing diagnostic capabilities because of its ability to analyze large datasets and detect minor patterns often overlooked by humans. This is beneficial, especially in pain recognition, where patient communication may be limited. However, ML models often face challenges such as memorization and sensitivity to adversarial examples. Regularization techniques like mixup, which trains models on convex combinations of data pairs, address these issues by enhancing model generalization. While mixup has proven effective in image, speech, and text datasets, its application to time-series signals like electrodermal activity (EDA) is less explored. This research uses ML for pain recognition with EDA signals from the BioVid Heat Pain Database to distinguish pain by applying mixup regularization to manually extracted EDA features and using a support vector machine (SVM) for classification. The results show that this approach achieves an average accuracy of 75.87% using leave-one-subject-out cross-validation (LOSOCV) compared to 74.61% without mixup. This demonstrates mixup’s efficacy in improving ML model accuracy for pain recognition from EDA signals. This study highlights the potential of mixup in ML as a promising approach to enhance pain assessment in healthcare. Full article
Show Figures

Figure 1

Figure 1
<p>Experimental setting with heat stimulation and recording of biopotentials and video.</p>
Full article ">Figure 2
<p>Elicited pain levels.</p>
Full article ">Figure 3
<p>Heat stimulus and pauses between stimuli.</p>
Full article ">Figure 4
<p>Data preprocessing. This figure illustrates signal segmentation. The classification experiments were performed on windows of a 4.5 s length, with a temporal shift of 4 s from the onset of elicitation.</p>
Full article ">Figure 5
<p>EDA processed signal.</p>
Full article ">Figure 6
<p>Machine learning model for pain recognition.</p>
Full article ">Figure 7
<p>ROC curve of an ML pain recognition model with and without mixup.</p>
Full article ">
23 pages, 4893 KiB  
Article
Enhancing Software Effort Estimation with Pre-Trained Word Embeddings: A Small-Dataset Solution for Accurate Story Point Prediction
by Issa Atoum and Ahmed Ali Otoom
Electronics 2024, 13(23), 4843; https://doi.org/10.3390/electronics13234843 - 8 Dec 2024
Viewed by 780
Abstract
Traditional software effort estimation methods, such as term frequency–inverse document frequency (TF-IDF), are widely used due to their simplicity and interpretability. However, they struggle with limited datasets, fail to capture intricate semantics, and suffer from dimensionality, sparsity, and computational inefficiency. This study used [...] Read more.
Traditional software effort estimation methods, such as term frequency–inverse document frequency (TF-IDF), are widely used due to their simplicity and interpretability. However, they struggle with limited datasets, fail to capture intricate semantics, and suffer from dimensionality, sparsity, and computational inefficiency. This study used pre-trained word embeddings, including FastText and GPT-2, to improve estimation accuracy in such cases. Seven pre-trained models were evaluated for their ability to effectively represent textual data, addressing the fundamental limitations of TF-IDF through contextualized embeddings. The results show that combining FastText embeddings with support vector machines (SVMs) consistently outperforms traditional approaches, reducing the mean absolute error (MAE) by 5–18% while achieving accuracy comparable to deep learning models like GPT-2. This approach demonstrated the adaptability of pre-trained embeddings for small datasets, balancing semantic richness with computational efficiency. The proposed method optimized project planning and resource allocation while enhancing software development through accurate story point prediction while safeguarding privacy and security through data anonymization. Future research will explore task-specific embeddings tailored to software engineering domains and investigate how dataset characteristics, such as cultural variations, influence model performance, ensuring the development of adaptable, robust, and secure machine learning models for diverse contexts. Full article
Show Figures

Figure 1

Figure 1
<p>Proposed methodology.</p>
Full article ">Figure 2
<p>Average MAE for pre-trained models across vector lengths (50–700). Lower values indicate better model performance, with FastText and SBERT achieving the lowest MAE.</p>
Full article ">Figure 2 Cont.
<p>Average MAE for pre-trained models across vector lengths (50–700). Lower values indicate better model performance, with FastText and SBERT achieving the lowest MAE.</p>
Full article ">Figure 3
<p>Average RMSE for pre-trained models across vector lengths. Lower values indicate better model performance, with FastText and SBERT achieving the lowest RMSE.</p>
Full article ">Figure 4
<p>Average MMRE for pre-trained models across vector lengths (50–700). Lower values indicate better model performance, with FastText and SBERT achieving the lowest.</p>
Full article ">Figure 4 Cont.
<p>Average MMRE for pre-trained models across vector lengths (50–700). Lower values indicate better model performance, with FastText and SBERT achieving the lowest.</p>
Full article ">Figure 5
<p>Average PRED (25) scores for pre-trained models. SBERT and USE achieved the highest percentage of predictions.</p>
Full article ">Figure 5 Cont.
<p>Average PRED (25) scores for pre-trained models. SBERT and USE achieved the highest percentage of predictions.</p>
Full article ">Figure 6
<p>Percentage improvement in MAE for pre-trained models compared to TF-IDF. FastText showed the highest improvement, particularly at a vector length of 700.</p>
Full article ">Figure 7
<p>Percentage improvement in MMRE for pre-trained models compared to TF-IDF. FastText consistently outperformed TF-IDF, demonstrating significant improvements.</p>
Full article ">Figure 8
<p>Percentage of improvement in RMSE for pre-trained models compared to TF-IDF. FastText demonstrated the highest improvement, underscoring its stability and reliability.</p>
Full article ">
22 pages, 1599 KiB  
Article
Single-Stage Entity–Relation Joint Extraction of Pesticide Registration Information Based on HT-BES Multi-Dimensional Labeling Strategy
by Chenyang Dong, Shiyu Xi, Yinchao Che, Shufeng Xiong, Xinming Ma, Lei Xi and Shuping Xiong
Algorithms 2024, 17(12), 559; https://doi.org/10.3390/a17120559 - 6 Dec 2024
Viewed by 358
Abstract
Pesticide registration information is an essential part of the pesticide knowledge base. However, the large amount of unstructured text data that it contains pose significant challenges for knowledge storage, retrieval, and utilization. To address the characteristics of pesticide registration text such as high [...] Read more.
Pesticide registration information is an essential part of the pesticide knowledge base. However, the large amount of unstructured text data that it contains pose significant challenges for knowledge storage, retrieval, and utilization. To address the characteristics of pesticide registration text such as high information density, complex logical structures, large spans between entities, and heterogeneous entity lengths, as well as to overcome the challenges faced when using traditional joint extraction methods, including triplet overlap, exposure bias, and redundant computation, we propose a single-stage entity–relation joint extraction model based on HT-BES multi-dimensional labeling (MD-SERel). First, in the encoding layer, to address the complex structural characteristics of pesticide registration texts, we employ RoBERTa combined with a multi-head self-attention mechanism to capture the deep semantic features of the text. Simultaneously, syntactic features are extracted using a syntactic dependency tree and graph neural networks to enhance the model’s understanding of text structure. Subsequently, we integrate semantic and syntactic features, enriching the character vector representations and thus improving the model’s ability to represent complex textual data. Secondly, in the multi-dimensional labeling framework layer, we use HT-BES multi-dimensional labeling, where the model assigns multiple labels to each character. These labels include entity boundaries, positions, and head–tail entity association information, which naturally resolves overlapping triplets. Through utilizing a parallel scoring function and fine-grained classification components, the joint extraction of entities and relations is transformed into a multi-label sequence labeling task based on relation dimensions. This process does not involve interdependent steps, thus enabling single-stage parallel labeling, preventing exposure bias and reducing computational redundancy. Finally, in the decoding layer, entity–relation triplets are decoded based on the predicted labels from the fine-grained classification. The experimental results demonstrate that the MD-SERel model performs well on both the Pesticide Registration Dataset (PRD) and the general DuIE dataset. On the PRD, compared to the optimal baseline model, the training time is 1.2 times faster, the inference time is 1.2 times faster, and the F1 score is improved by 1.5%, demonstrating its knowledge extraction capabilities in pesticide registration documents. On the DuIE dataset, the MD-SERel model also achieved better results compared to the baseline, demonstrating its strong generalization ability. These findings will provide technical support for the construction of pesticide knowledge bases. Full article
(This article belongs to the Special Issue Algorithms for Feature Selection (3rd Edition))
Show Figures

Figure 1

Figure 1
<p>MD-SERel model.</p>
Full article ">Figure 2
<p>Self-attention mechanism architecture diagram.</p>
Full article ">Figure 3
<p>Syntactic dependency matrix. In the <a href="#algorithms-17-00559-f003" class="html-fig">Figure 3</a>, (<b>a</b>) is the result of semantic analysis of example sentences. (<b>b</b>) is a semantic adjacency matrix constructed from (<b>a</b>).</p>
Full article ">Figure 4
<p>HT-BES interactive annotation strategy.</p>
Full article ">Figure 5
<p>The type and quantity distribution of entities and relations.</p>
Full article ">Figure 6
<p>Entity lengths.</p>
Full article ">Figure 7
<p>The results for different overlapping patterns of triples.</p>
Full article ">Figure 8
<p>The results of different self-attention head numbers.</p>
Full article ">
16 pages, 1602 KiB  
Article
Customer Churn Prediction Approach Based on LLM Embeddings and Logistic Regression
by Meryem Chajia and El Habib Nfaoui
Future Internet 2024, 16(12), 453; https://doi.org/10.3390/fi16120453 - 3 Dec 2024
Viewed by 894
Abstract
Nowadays, predicting customer churn is essential for the success of any company. Loyal customers generate continuous revenue streams, resulting in long-term success and growth. Moreover, companies are increasingly prioritizing the retention of existing customers due to the higher costs associated with attracting new [...] Read more.
Nowadays, predicting customer churn is essential for the success of any company. Loyal customers generate continuous revenue streams, resulting in long-term success and growth. Moreover, companies are increasingly prioritizing the retention of existing customers due to the higher costs associated with attracting new ones. Consequently, there has been a growing demand for advanced methods aimed at enhancing customer loyalty and satisfaction, as well as predicting churners. In our work, we focused on building a robust churn prediction model for the telecommunications industry based on large embeddings from large language models and logistic regression to accurately identify churners. We conducted extensive experiments using a range of embedding techniques, including OpenAI Text-embedding, Google Gemini Text Embedding, bidirectional encoder representations from transformers (BERT), Sentence-Transformers, Sent2vec, and Doc2vec, to extract meaningful features. Additionally, we tested various classifiers, including logistic regression, support vector machine, random forest, K-nearest neighbors, multilayer perceptron, naive Bayes, decision tree, and zero-shot classification, to build a robust model capable of making accurate predictions. The best-performing model in our experiments is the logistic regression classifier, which we trained using the extracted feature from the OpenAI Text-embedding-ada-002 model, achieving an accuracy of 89%. The proposed model demonstrates a high discriminative ability between churning and loyal customers. Full article
Show Figures

Figure 1

Figure 1
<p>Zero-Shot Classifier Methodology.</p>
Full article ">Figure 2
<p>Multilayer perceptron (MLP) architecture.</p>
Full article ">Figure 3
<p>Churn Predictive Model Building Methodology.</p>
Full article ">Figure 4
<p>Final Model Deployment.</p>
Full article ">
16 pages, 837 KiB  
Article
LIPT: Improving Prompt Tuning with Late Inception Reparameterization
by Yawen He, Ao Feng, Zhengjie Gao and Xinyu Song
Electronics 2024, 13(23), 4741; https://doi.org/10.3390/electronics13234741 - 29 Nov 2024
Viewed by 582
Abstract
Prompt tuning is a mainstream technique for fine-tuning large language models (LLMs), offering minimal parameter adjustments by learning task-specific prompt vectors. However, it suffers from training costs due to network-wide backpropagation and weaker performance compared to methods like adapters and LoRA, likely due [...] Read more.
Prompt tuning is a mainstream technique for fine-tuning large language models (LLMs), offering minimal parameter adjustments by learning task-specific prompt vectors. However, it suffers from training costs due to network-wide backpropagation and weaker performance compared to methods like adapters and LoRA, likely due to the limited capacity of soft prompts to encode task-specific information. This study introduces Late Inception Prompt Tuning (LIPT), a novel approach to soft prompt learning that enhances performance and efficiency by shortening backpropagation paths and employing a multidimensional bottleneck network with greater capacity. LIPT surpasses existing prompt tuning techniques on various benchmark tasks, delivering a 1.3% gain over LPT and a 5% improvement compared to standard prompt tuning when applied to RoBERTa-large, while converging more rapidly. It achieves an average accuracy of 90% across ten benchmark datasets. Notably, in certain scenarios, LIPT’s performance approaches that of full-parameter fine-tuning methods. To evaluate parameter-efficient fine-tuning (PEFT) comprehensively, we propose an Efficiency Indicator (EI) that balances accuracy and cost. LIPT is well suited for natural language understanding tasks, like sentiment analysis and text classification, with potential extensions to larger-scale models and tasks like text generation. This framework advances the scalability and practicality of fine-tuning methods for diverse applications. Full article
(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing)
Show Figures

Figure 1

Figure 1
<p>Overview of LIPT framework. Yellow represents trainable (tunable) modules, while blue represents frozen (non-trainable) modules. Left: A prompt generator with three bottleneck branches. The initialized prompt passes through three bottleneck branches of varying sizes and is then added back to the initial prompt (this connection pattern is similar to the Inception architecture). Right: The architecture of the transformer-based base model, along with the forward propagation and backpropagation paths. During forward propagation, the prompt generator on the left is inserted into the specified “Prompt Layer” of the model on the right in parallel. The generated prompt is concatenated with the output of the Prompt Layer and passed to the next layer. During backpropagation, only the prompt generator network is fine-tuned.</p>
Full article ">Figure 2
<p>Evaluate a one-sentence task and a two-sentence task. (<b>Left</b>): Comparison of five types of initialization methods. (<b>Right</b>): Add a self-connecting effect.</p>
Full article ">Figure 3
<p>LIPT performance on different tasks. (<b>Left</b>): Comparison of bottleneck numbers. (<b>Right</b>): Different module sizes.</p>
Full article ">Figure 4
<p>Performance trends of two single-sentence tasks and two two-sentence tasks at different insertion layers. The RoBERTa-large backbone model is used, selecting even-numbered layers between the 10th and 24th layers. The shaded area shows the mean and standard deviation of 3 different random runs.</p>
Full article ">
22 pages, 7770 KiB  
Article
Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation
by Azzah Allahim and Asma Cherif
Appl. Sci. 2024, 14(23), 11104; https://doi.org/10.3390/app142311104 - 28 Nov 2024
Viewed by 454
Abstract
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by [...] Read more.
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings. Full article
Show Figures

Figure 1

Figure 1
<p>Illustration of the skip-gram and continuous bag-of-word (CBOW) models.</p>
Full article ">Figure 2
<p>Building Arabic Word Embeddings Methodology.</p>
Full article ">Figure 3
<p>Illustration of the corpus preprocessing steps.</p>
Full article ">Figure 4
<p>A subset of the Google Analogy Test.</p>
Full article ">Figure 5
<p>Word2Vec models analogy test results.</p>
Full article ">Figure 6
<p>Model analogy test performance based on the window sizes of 5, 7, and 10 against the vector sizes 200, 300, and 400 for the CBOW and skip-gram approaches. (<b>a</b>) Word2Vec models analogy test results for the Watan corpus. (<b>b</b>) FastText models analogy test results for the Watan corpus. (<b>c</b>) Word2Vec models analogy test results for the Book corpus. (<b>d</b>) FastText models analogy test results for the Books corpus. (<b>e</b>) Word2Vec models analogy test results for the Wiki corpus. (<b>f</b>) FastText models analogy test results for the Wiki corpus.</p>
Full article ">Figure 7
<p>FastText models analogy test results.</p>
Full article ">Figure 8
<p>FastText models sentiment analysis results.</p>
Full article ">Figure 9
<p>Word2Vec models sentiment analysis results.</p>
Full article ">Figure 10
<p>SA accuracy of the models based on window sizes of 5, 7, and 10 against the vector sizes 200, 300, and 400 for the CBOW and skip-gram approaches. (<b>a</b>) Word2Vec models SA results for the Watan corpus. (<b>b</b>) FastText models SA results for the Watan corpus. (<b>c</b>) Word2Vec models SA results for the Book corpus. (<b>d</b>) FastText models SA results for the Books corpus. (<b>e</b>) Word2Vec models SA results for the Wiki corpus. (<b>f</b>) FastText models SA results for the Wiki corpus.</p>
Full article ">Figure 11
<p>FastText models similarity test results.</p>
Full article ">Figure 12
<p>Word2Vec models similarity test results.</p>
Full article ">
Back to TopTop