[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Issue
Volume 6, February
Previous Issue
Volume 5, December
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 

AI, Volume 6, Issue 1 (January 2025) – 18 articles

Cover Story (view full-size image): Eye-tracking Translation Software (ETS) leverages real-time eye-tracking to enhance reading flow and comprehension for non-native language users tackling complex texts. By detecting moments of cognitive load through fixation analysis, ETS provides dynamic, word-level translations without disrupting reading flow. This system blends advanced AI, user-centred design, and real-time OCR to adapt to individual needs, optimising language learning and efficiency. A study with 53 participants demonstrated ETS's ability to improve reading speed and comprehension while reducing cognitive strain. Future work aims to integrate adaptive AI features to further personalise user experiences. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
21 pages, 2546 KiB  
Article
Decoding Subjective Understanding: Using Biometric Signals to Classify Phases of Understanding
by Milan Lazic, Earl Woodruff and Jenny Jun
AI 2025, 6(1), 18; https://doi.org/10.3390/ai6010018 - 17 Jan 2025
Viewed by 792
Abstract
The relationship between the cognitive and affective dimensions of understanding has remained unexplored due to the lack of reliable methods for measuring emotions and feelings during learning. Focusing on five phases of understanding—nascent understanding, misunderstanding, confusion, emergent understanding, and deep understanding—this study introduces [...] Read more.
The relationship between the cognitive and affective dimensions of understanding has remained unexplored due to the lack of reliable methods for measuring emotions and feelings during learning. Focusing on five phases of understanding—nascent understanding, misunderstanding, confusion, emergent understanding, and deep understanding—this study introduces an AI-driven solution to measure subjective understanding by analyzing physiological activity manifested in facial expressions. To investigate these phases, 103 participants remotely worked on 15 riddles while their facial expressions were video recorded. Action units (AUs) for each phase instance were measured using AFFDEX software. AU patterns associated with each phase were then identified through the application of six supervised machine learning algorithms. Distinct AU patterns were found for all five phases, with gradient boosting machine and random forest models achieving the highest predictive accuracy. These findings suggest that physiological activity can be leveraged to reliably measure understanding. Further, they advance a novel approach for measuring and fostering understanding in educational settings, as well as developing adaptive learning technologies and personalized educational interventions. Future studies should explore how physiological signatures of understanding phases both reflect and influence their associated cognitive processes, as well as the generalizability of this study’s findings across diverse populations and learning contexts (A suite of AI tools was employed in the development of this paper: (1) ChatGPT4o (for writing clarity and reference checking), (2) Grammarly (for grammar and editorial corrections), and (3) ResearchRabbit (reference management)). Full article
Show Figures

Figure 1

Figure 1
<p>Simplified flowchart illustrating the methodology.</p>
Full article ">Figure 2
<p>Average distribution of AU frames for each phase of understanding.</p>
Full article ">Figure 3
<p>Facial expressions associated with (<b>1</b>) nascent understanding, (<b>2</b>) misunderstanding, (<b>3</b>) confusion, (<b>4</b>) emergent understanding, and (<b>5</b>) deep understanding.</p>
Full article ">Figure 4
<p>Line graph of test set weighted averages for precision, recall, F1 score, and AUC for optimized models and logistic regression.</p>
Full article ">Figure 5
<p>Confusion matrix for the optimized GBM model’s performance on the test set.</p>
Full article ">Figure 6
<p>Bar graph of feature importance scores for AUs.</p>
Full article ">Figure 7
<p>Line graphs of lasso coefficients for AUs per phase.</p>
Full article ">
18 pages, 451 KiB  
Article
Multidisciplinary ML Techniques on Gesture Recognition for People with Disabilities in a Smart Home Environment
by Christos Panagiotou, Evanthia Faliagka, Christos P. Antonopoulos and Nikolaos Voros
AI 2025, 6(1), 17; https://doi.org/10.3390/ai6010017 - 17 Jan 2025
Viewed by 1112
Abstract
Gesture recognition has a crucial role in Human–Computer Interaction (HCI) and in assisting the elderly to perform automatically their everyday activities. In this paper, three methods for gesture recognition and computer vision were implemented and tested in order to investigate the most suitable [...] Read more.
Gesture recognition has a crucial role in Human–Computer Interaction (HCI) and in assisting the elderly to perform automatically their everyday activities. In this paper, three methods for gesture recognition and computer vision were implemented and tested in order to investigate the most suitable one. All methods, machine learning using IMU, machine learning on device, and were combined with certain activities that were determined during a needs analysis research. The same volunteers took part in the pilot testing of the proposed methods. The results highlight the strengths and weaknesses of each approach, revealing that while some methods excel in specific scenarios, the integrated solution of MoveNet and CNN provides a robust framework for real-time gesture recognition. Full article
(This article belongs to the Special Issue Machine Learning for HCI: Cases, Trends and Challenges)
Show Figures

Figure 1

Figure 1
<p>Diagram of 10 s samples from the rightward movement and from the upward movement.</p>
Full article ">Figure 2
<p>Diagram of 10 s samples from the rightward and upward movements.</p>
Full article ">Figure 3
<p>Sequence diagram for the process of the vision-based gesture detection.</p>
Full article ">
17 pages, 1637 KiB  
Article
Advancements in End-to-End Audio Style Transformation: A Differentiable Approach for Voice Conversion and Musical Style Transfer
by Shashwat Aggarwal, Shashwat Uttam, Sameer Garg, Shubham Garg, Kopal Jain and Swati Aggarwal
AI 2025, 6(1), 16; https://doi.org/10.3390/ai6010016 - 17 Jan 2025
Viewed by 837
Abstract
Introduction: This study introduces a fully differentiable, end-to-end audio transformation network designed to overcome these limitations by operating directly on acoustic features. Methods: The proposed method employs an encoder–decoder architecture with a global conditioning mechanism. It eliminates the need for parallel utterances, intermediate [...] Read more.
Introduction: This study introduces a fully differentiable, end-to-end audio transformation network designed to overcome these limitations by operating directly on acoustic features. Methods: The proposed method employs an encoder–decoder architecture with a global conditioning mechanism. It eliminates the need for parallel utterances, intermediate phonetic representations, and speaker-independent ASR systems. The system is evaluated on tasks of voice conversion and musical style transfer using subjective and objective metrics. Results: Experimental results demonstrate the model’s efficacy, achieving competitive performance in both seen and unseen target scenarios. The proposed framework outperforms seven existing systems for audio transformation and aligns closely with state-of-the-art methods. Conclusion: This approach simplifies feature engineering, ensures vocabulary independence, and broadens the applicability of audio transformations across diverse domains, such as personalized voice assistants and musical experimentation. Full article
Show Figures

Figure 1

Figure 1
<p>Overview of the method. The encoder network takes acoustic features of the source audio as input. The reference encoder takes the Mel spectrogram of the source audio during training and of the target class during the testing phase. The decoder network combines the outputs of encoder and reference encoder networks to reconstruct or transform audio. A latent discriminator based adversarial training scheme is employed to learn target independent encoded representations.</p>
Full article ">Figure 2
<p>MFCC and spectrogram plots for source audio, target audio and generated audio for voice conversion and musical style transfer.</p>
Full article ">Figure 3
<p>Learned style embeddings. We visualize the learned style embeddings using two-dimensional t-SNE plots for six random speakers (three females and three males) on left and for four random musical instruments on right.</p>
Full article ">Figure 4
<p>MOS on naturalness. Computed for the following cases: inter-nationality, intra-nationality, inter-gender, and intra-gender.</p>
Full article ">
13 pages, 2472 KiB  
Article
Ischemic Stroke Lesion Segmentation on Multiparametric CT Perfusion Maps Using Deep Neural Network
by Ankit Kandpal, Rakesh Kumar Gupta and Anup Singh
AI 2025, 6(1), 15; https://doi.org/10.3390/ai6010015 - 17 Jan 2025
Viewed by 780
Abstract
Background: Accurate delineation of lesions in acute ischemic stroke is important for determining the extent of tissue damage and the identification of potentially salvageable brain tissues. Automatic segmentation on CT images is challenging due to the poor contrast-to-noise ratio. Quantitative CT perfusion images [...] Read more.
Background: Accurate delineation of lesions in acute ischemic stroke is important for determining the extent of tissue damage and the identification of potentially salvageable brain tissues. Automatic segmentation on CT images is challenging due to the poor contrast-to-noise ratio. Quantitative CT perfusion images improve the estimation of the perfusion deficit regions; however, they are limited by a poor signal-to-noise ratio. The study aims to investigate the potential of deep learning (DL) algorithms for the improved segmentation of ischemic lesions. Methods: This study proposes a novel DL architecture, DenseResU-NetCTPSS, for stroke segmentation using multiparametric CT perfusion images. The proposed network is benchmarked against state-of-the-art DL models. Its performance is assessed using the ISLES-2018 challenge dataset, a widely recognized dataset for stroke segmentation in CT images. The proposed network was evaluated on both training and test datasets. Results: The final optimized network takes three image sequences, namely CT, cerebral blood volume (CBV), and time to max (Tmax), as input to perform segmentation. The network achieved a dice score of 0.65 ± 0.19 and 0.45 ± 0.32 on the training and testing datasets. The model demonstrated a notable improvement over existing state-of-the-art DL models. Conclusions: The optimized model combines CT, CBV, and Tmax images, enabling automatic lesion identification with reasonable accuracy and aiding radiologists in faster, more objective assessments. Full article
Show Figures

Figure 1

Figure 1
<p>Representative brain images of an ischemic stroke patient from the ISLES 2018 Challenge Dataset. CT perfusion maps are shown in pseudo-color to correspond to the quantitative nature of the maps.</p>
Full article ">Figure 2
<p>Systematic workflow for skull-stripping. CT perfusion maps are shown in pseudo-color to correspond to the quantitative nature of the maps.</p>
Full article ">Figure 3
<p>Illustration of (<b>a</b>) the proposed network architecture of DenseResU-NetCTPSS and (<b>b</b>) the internal architecture of DenseRes block.</p>
Full article ">Figure 4
<p>Illustrates the predictions generated by the proposed DenseResU-NetCTPSS. The first column represents the CT images, the second column shows the CBV maps, and the third column shows the T<sub>max</sub> maps. CT perfusion maps are shown in pseudo-color to correspond to the quantitative nature of the maps. The fourth and fifth columns show the ground truth (white contour) and prediction (red contour) overlaid on the T<sub>max</sub> maps.</p>
Full article ">Figure 5
<p>Grad-CAM visualizations show the attention maps for stroke lesion segmentation. Input image refers to CT, CBV, and T<sub>max</sub> images combined into an RGB image. Subsequent images illustrate regions on activation maps contributing most significantly to DenseResU-NetCTPSS’s predictions at different DenseRes blocks. Red areas indicate the strongest focus. Yellow and green areas indicate intermediate activation levels showing regions with moderate importance, while blue areas indicate minimal contribution to model’s decision-making.</p>
Full article ">
24 pages, 2585 KiB  
Article
Evaluating AI-Driven Mental Health Solutions: A Hybrid Fuzzy Multi-Criteria Decision-Making Approach
by Yewande Ojo, Olasumbo Ayodeji Makinde, Oluwabukunmi Victor Babatunde, Gbotemi Babatunde and Subomi Okeowo
AI 2025, 6(1), 14; https://doi.org/10.3390/ai6010014 - 16 Jan 2025
Viewed by 1272
Abstract
Background: AI-driven mental health solutions offer transformative potential for improving mental healthcare outcomes, but identifying the most effective approaches remains a challenge. This study addresses this gap by evaluating and prioritizing AI-driven mental health alternatives based on key criteria, including feasibility of implementation, [...] Read more.
Background: AI-driven mental health solutions offer transformative potential for improving mental healthcare outcomes, but identifying the most effective approaches remains a challenge. This study addresses this gap by evaluating and prioritizing AI-driven mental health alternatives based on key criteria, including feasibility of implementation, cost-effectiveness, scalability, ethical compliance, user satisfaction, and impact on clinical outcomes. Methods: A fuzzy multi-criteria decision-making (MCDM) model, consisting of fuzzy TOPSIS and fuzzy ARAS, was employed to rank the alternatives, while a hybridization of the two methods was used to address discrepancies between the methods, each emphasizing distinct evaluative aspect. Results: Fuzzy TOPSIS, focusing on closeness to the ideal solution, ranked personalization of care (A5) as the top alternative with a closeness coefficient of 0.50, followed by user engagement (A2) at 0.45. Fuzzy ARAS, which evaluates cumulative performance, also ranked A5 the highest, with an overall performance rating of Si = 0.90 and utility degree Qi = 0.92. Combining both methods provided a balanced assessment, with A5 retaining its top position due to high scores in user satisfaction and clinical outcomes. Conclusions: This result underscores the importance of personalization and engagement in optimizing AI-driven mental health solutions, suggesting that tailored, user-focused approaches are pivotal for maximizing treatment success and user adherence. Full article
(This article belongs to the Section Medical & Healthcare AI)
Show Figures

Figure 1

Figure 1
<p>Methodology of the study.</p>
Full article ">Figure 2
<p>Triangular fuzzy quantity.</p>
Full article ">Figure 3
<p>Triangular 10-point linguistic scale used in this study [<a href="#B26-ai-06-00014" class="html-bibr">26</a>].</p>
Full article ">Figure 4
<p>The hierarchical structure of the problems considered in this study.</p>
Full article ">Figure 5
<p>Ethical perspectives considered in this study.</p>
Full article ">Figure 6
<p>Closeness coefficient, utility degree, and hybrid score for the alternatives.</p>
Full article ">Figure 7
<p>Rank for TOPSIS, ARAS, and hybrid rank methods.</p>
Full article ">
17 pages, 863 KiB  
Article
Digital Diagnostics: The Potential of Large Language Models in Recognizing Symptoms of Common Illnesses
by Gaurav Kumar Gupta, Aditi Singh, Sijo Valayakkad Manikandan and Abul Ehtesham
AI 2025, 6(1), 13; https://doi.org/10.3390/ai6010013 - 16 Jan 2025
Viewed by 1661
Abstract
This study aimed to evaluate the potential of Large Language Models (LLMs) in healthcare diagnostics, specifically their ability to analyze symptom-based prompts and provide accurate diagnoses. The study focused on models including GPT-4, GPT-4o, Gemini, o1 Preview, and GPT-3.5, assessing their performance in [...] Read more.
This study aimed to evaluate the potential of Large Language Models (LLMs) in healthcare diagnostics, specifically their ability to analyze symptom-based prompts and provide accurate diagnoses. The study focused on models including GPT-4, GPT-4o, Gemini, o1 Preview, and GPT-3.5, assessing their performance in identifying illnesses based solely on provided symptoms. Symptom-based prompts were curated from reputable medical sources to ensure validity and relevance. Each model was tested under controlled conditions to evaluate their diagnostic accuracy, precision, recall, and decision-making capabilities. Specific scenarios were designed to explore their performance in both general and high-stakes diagnostic tasks. Among the models, GPT-4 achieved the highest diagnostic accuracy, demonstrating strong alignment with medical reasoning. Gemini excelled in high-stakes scenarios requiring precise decision-making. GPT-4o and o1 Preview showed balanced performance, effectively handling real-time diagnostic tasks with a focus on both precision and recall. GPT-3.5, though less advanced, proved dependable for general diagnostic tasks. This study highlights the strengths and limitations of LLMs in healthcare diagnostics. While models such as GPT-4 and Gemini exhibit promise, challenges such as privacy compliance, ethical considerations, and the mitigation of inherent biases must be addressed. The findings suggest pathways for responsibly integrating LLMs into diagnostic processes to enhance healthcare outcomes. Full article
Show Figures

Figure 1

Figure 1
<p>Research Strategy.</p>
Full article ">Figure 2
<p>Word cloud visualization of the diseases included in the dataset.</p>
Full article ">Figure 3
<p>Performance metrics and predictions of models: scatter plots showcasing confusion matrix values, derived metrics, and the correctness of model predictions.</p>
Full article ">Figure 4
<p>Model confidence category.</p>
Full article ">Figure 5
<p>Performance metrics.</p>
Full article ">
18 pages, 585 KiB  
Article
Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
by Zhiyong Han, Fortunato Battaglia, Kush Mansuria, Yoav Heyman and Stanley R. Terlecky
AI 2025, 6(1), 12; https://doi.org/10.3390/ai6010012 - 15 Jan 2025
Viewed by 992
Abstract
The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and [...] Read more.
The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development. Full article
Show Figures

Figure 1

Figure 1
<p>Types of rule violations in each puzzle solution by LLMs when solving word ladder puzzles in 3 independent trials. The LLMs were tasked in 3 separate trials to solve ten-word ladder puzzles as described in the methods. Each solution was evaluated for presence of the following rule violations: presence of invalid words, more than one letter change per step, word length change, word repeat, or any combination of rule violations.</p>
Full article ">
24 pages, 651 KiB  
Article
The Shifting Influence: Comparing AI Tools and Human Influencers in Consumer Decision-Making
by Michael Gerlich
AI 2025, 6(1), 11; https://doi.org/10.3390/ai6010011 - 14 Jan 2025
Viewed by 1558
Abstract
This study investigates the evolving role of artificial intelligence (AI) in consumer decision-making, particularly in comparison to traditional human influencers. As consumer trust in social media influencers has declined, largely due to concerns about the financial motivations behind endorsements, AI tools such as [...] Read more.
This study investigates the evolving role of artificial intelligence (AI) in consumer decision-making, particularly in comparison to traditional human influencers. As consumer trust in social media influencers has declined, largely due to concerns about the financial motivations behind endorsements, AI tools such as ChatGPT have emerged as perceived neutral intermediaries. The research aims to understand whether AI systems can replace human influencers in shaping purchasing decisions and, if so, in which sectors. A mixed-methods approach was employed, involving a quantitative survey of 478 participants with prior experience using both AI tools and interacting with social media influencers, complemented by 15 semi-structured interviews. The results reveal that AI is favoured over human influencers in product categories where objectivity and precision are critical, such as electronics and sporting goods, while human influencers remain influential in emotionally driven sectors like fashion and beauty. These findings suggest that the future of marketing will show a reduced need for human social media influencers and may involve a hybrid model where AI systems dominate data-driven recommendations and human influencers continue to foster emotional engagement. This shift has important implications for brands as they adapt to changing consumer trust dynamics. Full article
Show Figures

Figure 1

Figure 1
<p>Cross-validation performance of ridge model.</p>
Full article ">
18 pages, 429 KiB  
Review
Bridging the Gap in the Adoption of Trustworthy AI in Indian Healthcare: Challenges and Opportunities
by Sarat Kumar Chettri, Rup Kumar Deka and Manob Jyoti Saikia
AI 2025, 6(1), 10; https://doi.org/10.3390/ai6010010 - 13 Jan 2025
Viewed by 2269
Abstract
The healthcare sector in India has experienced significant transformations owing to the advancement in technology and infrastructure. Despite these transformations, there are major challenges to address critical issues like insufficient healthcare infrastructure for the country’s huge population, limited accessibility, shortage of skilled professionals, [...] Read more.
The healthcare sector in India has experienced significant transformations owing to the advancement in technology and infrastructure. Despite these transformations, there are major challenges to address critical issues like insufficient healthcare infrastructure for the country’s huge population, limited accessibility, shortage of skilled professionals, and high-quality care. Artificial intelligence (AI)-driven solutions have the potential to lessen the stress on India’s healthcare system; however, integrating trustworthy AI in the sector remains challenging due to ethical and regulatory constraints. This study aims to critically review the current status of the development of AI systems in Indian healthcare and how well it satisfies the ethical and legal aspects of AI, as well as to identify the challenges and opportunities in adoption of trustworthy AI in the Indian healthcare sector. This study reviewed 15 articles selected from a total of 1136 articles gathered from two electronic databases, PubMed and Google Scholar, as well as project websites. This study makes use of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR). It finds that the existing studies mostly used conventional machine learning (ML) algorithms and artificial neural networks (ANNs) for a variety of tasks, such as drug discovery, disease surveillance systems, early disease detection and diagnostic accuracy, and management of healthcare resources in India. This study identifies a gap in the adoption of trustworthy AI in Indian healthcare and various challenges associated with it. It explores opportunities for developing trustworthy AI in Indian healthcare settings, prioritizing patient safety, data privacy, and compliance with ethical and legal standards. Full article
(This article belongs to the Section Medical & Healthcare AI)
Show Figures

Figure 1

Figure 1
<p>Framework of trustworthy AI in healthcare.</p>
Full article ">Figure 2
<p>Market size of AI in Indian healthcare (2015–2032 as reported in Statista [<a href="#B17-ai-06-00010" class="html-bibr">17</a>]).</p>
Full article ">Figure 3
<p>PRISMA-ScR flow diagram for scoping review.</p>
Full article ">
15 pages, 849 KiB  
Article
Designing Channel Attention Fully Convolutional Networks with Neural Architecture Search for Customer Socio-Demographic Information Identification Using Smart Meter Data
by Zhirui Luo, Qingqing Li, Ruobin Qi and Jun Zheng
AI 2025, 6(1), 9; https://doi.org/10.3390/ai6010009 - 10 Jan 2025
Viewed by 790
Abstract
Background: Accurately identifying the socio-demographic information of customers is crucial for utilities. It enables them to efficiently deliver personalized energy services and manage distribution networks. In recent years, machine learning-based data-driven methods have gained popularity compared to traditional survey-based approaches, owing to their [...] Read more.
Background: Accurately identifying the socio-demographic information of customers is crucial for utilities. It enables them to efficiently deliver personalized energy services and manage distribution networks. In recent years, machine learning-based data-driven methods have gained popularity compared to traditional survey-based approaches, owing to their time and cost efficiency, as well as the availability of a large amount of high-frequency smart meter data. Methods: In this paper, we propose a new method that harnesses the power of neural architecture search to automatically design deep neural network architectures tailored for identifying various socio-demographic information of customers using smart meter data. We designed a search space based on a novel channel attention fully convolutional network architecture. Furthermore, we developed a search algorithm based on Bayesian optimization to effectively explore the space and identify high-performing architectures. Results: The performance of the proposed method was evaluated and compared with a set of machine learning and deep learning baseline methods using a smart meter dataset widely used in this research area. Our results show that the deep neural network architectures designed automatically by our proposed method significantly outperform all baseline methods in addressing the socio-demographic questions investigated in our study. Full article
(This article belongs to the Section AI Systems: Theory and Applications)
Show Figures

Figure 1

Figure 1
<p>Workflow of the proposed NAS-based method for data-driven customer socio-demographic information identification.</p>
Full article ">Figure 2
<p>CAFCN architecture.</p>
Full article ">Figure 3
<p>Operation flow of an SE block.</p>
Full article ">Figure 4
<p>Distributions of validation F<math display="inline"><semantics> <msub> <mn>1</mn> <mrow> <mi>M</mi> <mi>a</mi> <mi>c</mi> <mi>r</mi> <mi>o</mi> </mrow> </msub> </semantics></math> of searched architectures for the ten questions.</p>
Full article ">Figure 5
<p>Maximum validation <math display="inline"><semantics> <mrow> <mi>F</mi> <msub> <mn>1</mn> <mrow> <mi>M</mi> <mi>a</mi> <mi>c</mi> <mi>r</mi> <mi>o</mi> </mrow> </msub> </mrow> </semantics></math> score vs. number of evaluated models for the ten questions.</p>
Full article ">Figure 6
<p>SEACAT-Net architectures for the ten socio-demographic questions.</p>
Full article ">Figure 7
<p>Confusion matrices obtained by RF and SEACAT-Net for question 300.</p>
Full article ">
20 pages, 7167 KiB  
Article
Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays
by Nourhan Zayed, Nahed Tawfik, Mervat M. A. Mahmoud, Ahmed Fawzy, Young-Im Cho and Mohamed S. Abdallah
AI 2025, 6(1), 8; https://doi.org/10.3390/ai6010008 - 9 Jan 2025
Viewed by 975
Abstract
Convolutional neural networks (CNNs) are increasingly recognized as an important and potent artificial intelligence approach, widely employed in many computer vision applications, such as facial recognition. Their importance resides in their capacity to acquire hierarchical features, which is essential for recognizing complex patterns. [...] Read more.
Convolutional neural networks (CNNs) are increasingly recognized as an important and potent artificial intelligence approach, widely employed in many computer vision applications, such as facial recognition. Their importance resides in their capacity to acquire hierarchical features, which is essential for recognizing complex patterns. Nevertheless, the intricate architectural design of CNNs leads to significant computing requirements. To tackle these issues, it is essential to construct a system based on field-programmable gate arrays (FPGAs) to speed up CNNs. FPGAs provide fast development capabilities, energy efficiency, decreased latency, and advanced reconfigurability. A facial recognition solution by leveraging deep learning and subsequently deploying it on an FPGA platform is suggested. The system detects whether a person has the necessary authorization to enter/access a place. The FPGA is responsible for processing this system with utmost security and without any internet connectivity. Various facial recognition networks are accomplished, including AlexNet, ResNet, and VGG-16 networks. The findings of the proposed method prove that the GoogLeNet network is the best fit due to its lower computational resource requirements, speed, and accuracy. The system was deployed on three hardware kits to appraise the performance of different programming approaches in terms of accuracy, latency, cost, and power consumption. The software programming on the Raspberry Pi-3B kit had a recognition accuracy of around 70–75% and relied on a stable internet connection for processing. This dependency on internet connectivity increases bandwidth consumption and fails to meet the required security criteria, contrary to ZYBO-Z7 board hardware programming. Nevertheless, the hardware/software co-design on the PYNQ-Z2 board achieved an accuracy rate of 85% to 87%. It operates independently of an internet connection, making it a standalone system and saving costs. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Image Processing and Computer Vision)
Show Figures

Figure 1

Figure 1
<p>AT &amp; T dataset (downloaded from <a href="https://www.kaggle.com/datasets/kasikrit/att-database-of-faces" target="_blank">https://www.kaggle.com/datasets/kasikrit/att-database-of-faces</a>; accessed on 13 October 2024).</p>
Full article ">Figure 2
<p>PYNQ-Z2.</p>
Full article ">Figure 3
<p>Zybo Z7-20 Zynq-7000 SoC development board.</p>
Full article ">Figure 4
<p>Raspberry Pi 3 Model B.</p>
Full article ">Figure 5
<p>OV7670 camera module.</p>
Full article ">Figure 6
<p>A flowchart of the proposed model.</p>
Full article ">Figure 7
<p>Top level block diagram. The light blue block (3) is a regular IP, while blue blocks (1, 2, and 4) are hierarchy blocks, grouping IP blocks together. Block no. 1, named camera_in, is the original data producer. It groups together the IP blocks needed to decode image data coming from the camera and to format it to suit our needs. Block no. 2, named video_out, is the ultimate data consumer. It groups IP blocks doing DVI encoding, so that the image data can be displayed on a monitor. We are going to look at these two hierarchy blocks later. Block no. 3 is an actual IP, named axi_vdma. It is a Xilinx IP with the full name AXI Video Direct Memory Access. VDMA sits in the middle of the video data flow, and its central role makes it an interesting addition. It is needed to decouple two incompatible video interfaces, the image sensor’s MIPI CSI-2 and the monitor’s DVI.</p>
Full article ">Figure 8
<p>The hierarchy of the control block, which illustrates, the input, output, and control interfaces modelled in C/C++.</p>
Full article ">Figure 9
<p>AlexNet accuracy.</p>
Full article ">Figure 10
<p>AlexNet loss.</p>
Full article ">Figure 11
<p>ResNet18 accuracy.</p>
Full article ">Figure 12
<p>ResNet18 loss.</p>
Full article ">Figure 13
<p>Accuracy of the VGG16 network.</p>
Full article ">Figure 14
<p>Loss curve of the VGG16 network.</p>
Full article ">Figure 15
<p>GoogLeNet accuracy.</p>
Full article ">Figure 16
<p>GoogLeNet loss curve.</p>
Full article ">
17 pages, 310 KiB  
Article
AI-Driven Innovations in Tourism: Developing a Hybrid Framework for the Saudi Tourism Sector
by Abdulkareem Alzahrani, Abdullah Alshehri, Maha Alamri and Saad Alqithami
AI 2025, 6(1), 7; https://doi.org/10.3390/ai6010007 - 9 Jan 2025
Viewed by 1352
Abstract
In alignment with Saudi Vision 2030’s strategic objectives to diversify and enhance the tourism sector, this study explores the integration of Artificial Intelligence (AI) in the Al-Baha district, a prime tourist destination in Saudi Arabia. Our research introduces a hybrid AI-based framework that [...] Read more.
In alignment with Saudi Vision 2030’s strategic objectives to diversify and enhance the tourism sector, this study explores the integration of Artificial Intelligence (AI) in the Al-Baha district, a prime tourist destination in Saudi Arabia. Our research introduces a hybrid AI-based framework that leverages sentiment analysis to assess and enhance tourist satisfaction, capitalizing on data extracted from social media platforms such as YouTube. This framework seeks to improve the quality of tourism experiences and augment the business value within the region. By analyzing sentiments expressed in user-generated content, the proposed AI system provides real-time insights into tourist preferences and experiences, enabling targeted interventions and improvements. The conducted experiments demonstrated the framework’s efficacy in identifying positive, neutral and negative sentiments, with the Multinomial Naive Bayes classifier showing superior performance in terms of precision and recall. These results indicate significant potential for AI to transform tourism practices in Al-Baha, offering enhanced experiences to visitors and driving the economic sustainability of the sector in line with the national vision. This study underscores the transformative potential of AI in refining operational strategies and aligning them with evolving tourist expectations, thereby supporting the broader goals of Saudi Vision 2030 for the tourism industry. Full article
Show Figures

Figure 1

Figure 1
<p>Illustration of one-versus-all SVM classifiers with hyperplanes in a 2D feature space.</p>
Full article ">Figure 2
<p>Illustration of K-Nearest Neighbors (KNN) classification for sentiment analysis. The test point (yellow square) is classified based on the majority sentiment of its neighbors. Blue circles represent positive sentiment points, while red squares indicate negative sentiment points. The dashed circle marks the boundary of the neighborhood defined by the value of k (e.g., 3 nearest neighbors). The classification is determined by the dominant sentiment among the neighbors within this boundary.</p>
Full article ">
18 pages, 4002 KiB  
Article
MultiSenseX: A Sustainable Solution for Multi-Human Activity Recognition and Localization in Smart Environments
by Hamada Rizk, Ahmed Elmogy, Mohamed Rihan and Hirozumi Yamaguchi
AI 2025, 6(1), 6; https://doi.org/10.3390/ai6010006 - 6 Jan 2025
Viewed by 869
Abstract
WiFi-based human sensing has emerged as a transformative technology for advancing sustainable living environments and promoting well-being by enabling non-intrusive and device-free monitoring of human behaviors. This offers significant potential in applications such as smart homes and sustainable urban spaces and healthcare systems [...] Read more.
WiFi-based human sensing has emerged as a transformative technology for advancing sustainable living environments and promoting well-being by enabling non-intrusive and device-free monitoring of human behaviors. This offers significant potential in applications such as smart homes and sustainable urban spaces and healthcare systems that enhance well-being and patient monitoring. However, current research predominantly addresses single-user scenarios, limiting its applicability in multi-user environments. In this work, we introduce “MultiSenseX”, a cutting-edge system leveraging a multi-label, multi-view Transformer-based architecture to achieve simultaneous localization and activity recognition in multi-occupant settings. By employing advanced preprocessing techniques and utilizing the Transformer’s self-attention mechanism, MultiSenseX effectively learns complex patterns of human activity and location from Channel State Information (CSI) data. This approach transcends traditional sequential methods, enabling accurate and real-time analysis in dynamic, multi-user contexts. Our empirical evaluation demonstrates MultiSenseX’s superior performance in both localization and activity recognition tasks, achieving remarkable accuracy and scalability. By enhancing multi-user sensing technologies, MultiSenseX supports the development of intelligent, efficient, and sustainable communities, contributing to SDG 11 (Sustainable Cities and Communities) and SDG 3 (Good Health and Well-being) through safer, smarter, and more inclusive urban living solutions. Full article
Show Figures

Figure 1

Figure 1
<p>t-SNE visualization of CSI samples when classifying the activities of one person in an environment occupied by that person only.</p>
Full article ">Figure 2
<p>t-SNE visualization of CSI samples when classifying the activities of one person in an environment occupied by other persons.</p>
Full article ">Figure 3
<p><span class="html-italic">MultiSenseX</span> system architecture.</p>
Full article ">Figure 4
<p>Network architecture.</p>
Full article ">Figure 5
<p>Example showing the participants practicing different activities [<a href="#B28-ai-06-00006" class="html-bibr">28</a>]. User 1: Rotation. User 2: Jumping. User 3: Waving. User 4: Lying Down. User 5: Picking Up.</p>
Full article ">Figure 6
<p>Classroom environment [<a href="#B28-ai-06-00006" class="html-bibr">28</a>].</p>
Full article ">Figure 7
<p>Meeting environment [<a href="#B28-ai-06-00006" class="html-bibr">28</a>].</p>
Full article ">Figure 8
<p>Empty environment [<a href="#B28-ai-06-00006" class="html-bibr">28</a>].</p>
Full article ">Figure 9
<p>Comparative analysis of localization and activity recognition performance for the different systems in the Classroom.</p>
Full article ">Figure 10
<p>Comparative analysis of localization and activity recognition performance for the different systems in the Meeting Room.</p>
Full article ">Figure 11
<p>Comparative analysis of localization and activity recognition performance for the different systems in the Empty Room.</p>
Full article ">Figure 12
<p>Localization performance of <span class="html-italic">MultiSenseX</span> in different frequency settings.</p>
Full article ">Figure 13
<p>Activity recognition performance of <span class="html-italic">MultiSenseX</span> in different frequency settings.</p>
Full article ">Figure 14
<p>The effect of cell size on the performance of <span class="html-italic">MultiSenseX</span>.</p>
Full article ">Figure 15
<p>The effect of changing the number of users on the performance of <span class="html-italic">MultiSenseX</span>.</p>
Full article ">
24 pages, 3468 KiB  
Article
Adaptive Real-Time Translation Assistance Through Eye-Tracking
by Dimosthenis Minas, Eleanna Theodosiou, Konstantinos Roumpas and Michalis Xenos
AI 2025, 6(1), 5; https://doi.org/10.3390/ai6010005 - 2 Jan 2025
Viewed by 1183
Abstract
This study introduces the Eye-tracking Translation Software (ETS), a system that leverages eye-tracking data and real-time translation to enhance reading flow for non-native language users in complex, technical texts. By measuring the fixation duration, we can detect moments of cognitive load, ETS selectively [...] Read more.
This study introduces the Eye-tracking Translation Software (ETS), a system that leverages eye-tracking data and real-time translation to enhance reading flow for non-native language users in complex, technical texts. By measuring the fixation duration, we can detect moments of cognitive load, ETS selectively provides translations, maintaining reading flow and engagement without undermining language learning. The key technological components include a desktop eye-tracker integrated with a custom Python-based application. Through a user-centered design, ETS dynamically adapts to individual reading needs, reducing cognitive strain by offering word-level translations when needed. A study involving 53 participants assessed ETS’s impact on reading speed, fixation duration, and user experience, with findings indicating improved comprehension and reading efficiency. Results demonstrated that gaze-based adaptations significantly improved their reading experience and reduced cognitive load. Participants positively rated ETS’s usability and were noted through preferences for customization, such as pop-up placement and sentence-level translations. Future work will integrate AI-driven adaptations, allowing the system to adjust based on user proficiency and reading behavior. The study contributes to the growing evidence of eye-tracking’s potential in educational and professional applications, offering a flexible, personalized approach to reading assistance that balances language exposure with real-time support. Full article
(This article belongs to the Special Issue Machine Learning for HCI: Cases, Trends and Challenges)
Show Figures

Figure 1

Figure 1
<p>Mockup screen of the original translation pop-up, where user frustration was identified.</p>
Full article ">Figure 2
<p>The final graphical user interface of ETS is when the user triggers a translation.</p>
Full article ">Figure 3
<p>The user’s document is displayed as scrollable content, and a connection with the eye-tracker can be initialized.</p>
Full article ">Figure 4
<p>Experimental laboratory.</p>
Full article ">Figure 5
<p>Word with and without ETS.</p>
Full article ">
20 pages, 3519 KiB  
Article
Attention-Based Hybrid Deep Learning Models for Classifying COVID-19 Genome Sequences
by A. M. Mutawa
AI 2025, 6(1), 4; https://doi.org/10.3390/ai6010004 - 2 Jan 2025
Viewed by 894
Abstract
Background: COVID-19 genetic sequence research is crucial despite immunizations and pandemic control. COVID-19-causing SARS-CoV-2 must be understood genomically for several reasons. New viral strains may resist vaccines. Categorizing genetic sequences helps researchers track changes and assess immunization efficacy. Classifying COVID-19 genome sequences with [...] Read more.
Background: COVID-19 genetic sequence research is crucial despite immunizations and pandemic control. COVID-19-causing SARS-CoV-2 must be understood genomically for several reasons. New viral strains may resist vaccines. Categorizing genetic sequences helps researchers track changes and assess immunization efficacy. Classifying COVID-19 genome sequences with other viruses helps to understand its evolution and interactions with other illnesses. Methods: The proposed study introduces a deep learning-based COVID-19 genomic sequence categorization approach. Attention-based hybrid deep learning (DL) models categorize 1423 COVID-19 and 11,388 other viral genome sequences. An unknown dataset is also used to assess the models. The five models’ accuracy, f1-score, area under the curve (AUC), precision, Matthews correlation coefficient (MCC), and recall are evaluated. Results: The results indicate that the Convolutional neural network (CNN) with Bidirectional long short-term memory (BLSTM) with attention layer (CNN-BLSTM-Att) achieved an accuracy of 99.99%, which outperformed the other models. For external validation, the model shows an accuracy of 99.88%. It reveals that DL-based approaches with an attention layer can accurately classify COVID-19 genomic sequences with a high degree of accuracy. This method might assist in identifying and classifying COVID-19 virus strains in clinical situations. Immunizations have lowered COVID-19 danger, but categorizing its genetic sequences is crucial to global health activities to plan for recurrence or future viral threats. Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Methodology of the study.</p>
Full article ">Figure 2
<p>The proposed study architecture for sequence classification.</p>
Full article ">Figure 3
<p>The workflow of each layer in the hybrid CNN model. The embedding layer will be given as the input for bidirectional models without the CNN layer.</p>
Full article ">Figure 4
<p>The pseudo-code for the proposed genome sequence classification study.</p>
Full article ">Figure 5
<p>The DNA nucleotide on a random sample of COVID-19 genome sequence.</p>
Full article ">Figure 6
<p>The DNA nucleotide on a random sample of other viruses’ genome sequence.</p>
Full article ">Figure 7
<p>Dot plot of the genome sequence of COVID-19 and other viruses. A yellow color indicates matching nucleotides.</p>
Full article ">Figure 8
<p>Difference of means using the Tukey test.</p>
Full article ">Figure 9
<p>Receiver operating characteristic plot with k-mer value = 6.</p>
Full article ">Figure 10
<p>The 5-fold CV results for training and validation set with CNN-BLSTM-Att model.</p>
Full article ">
19 pages, 1770 KiB  
Article
Application of Conversational AI Models in Decision Making for Clinical Periodontology: Analysis and Predictive Modeling
by Albert Camlet, Aida Kusiak and Dariusz Świetlik
AI 2025, 6(1), 3; https://doi.org/10.3390/ai6010003 - 2 Jan 2025
Viewed by 930
Abstract
(1) Background: Language represents a crucial ability of humans, enabling communication and collaboration. ChatGPT is an AI chatbot utilizing the GPT (Generative Pretrained Transformer) language model architecture, enabling the generation of human-like text. The aim of the research was to assess the effectiveness [...] Read more.
(1) Background: Language represents a crucial ability of humans, enabling communication and collaboration. ChatGPT is an AI chatbot utilizing the GPT (Generative Pretrained Transformer) language model architecture, enabling the generation of human-like text. The aim of the research was to assess the effectiveness of ChatGPT-3.5 and the latest version, ChatGPT-4, in responding to questions posed within the scope of a periodontology specialization exam. (2) Methods: Two certification examinations in periodontology, available in both English and Polish, comprising 120 multiple-choice questions, each in a single-best-answer format. The questions were additionally assigned to five types in accordance with the subject covered. These exams were utilized to evaluate the performance of ChatGPT-3.5 and ChatGPT-4. Logistic regression models were used to estimate the chances of correct answers regarding the type of question, exam session, AI model, and difficulty index. (3) Results: The percentages of correct answers obtained by ChatGPT-3.5 and ChatGPT-4 in the Spring 2023 session in Polish and English were 40.3% vs. 55.5% and 45.4% vs. 68.9%, respectively. The periodontology specialty examination test accuracy of ChatGPT-4 was significantly better than that of ChatGPT-3.5 for both sessions (p < 0.05). For the ChatGPT-4 spring session, it was significantly more effective in the English language (p = 0.0325) due to the lack of statistically significant differences for ChatGPT-3.5. In the case of ChatGPT-3.5 and ChatGPT-4, incorrect responses showed notably lower difficulty index values during the Spring 2023 session in English and Polish (p < 0.05). (4) Conclusions: ChatGPT-4 exceeded the 60% threshold and passed the examination in the Spring 2023 session in the English version. In general, ChatGPT-4 performed better than ChatGPT-3.5, achieving significantly better results in the Spring 2023 test in the Polish and English versions. Full article
Show Figures

Figure 1

Figure 1
<p>Example of second attempt input and output.</p>
Full article ">Figure 2
<p>Flow diagram of question selection and exclusion—the overview of the research approach.</p>
Full article ">Figure 3
<p>Performance of ChatGPT-3.5 and ChatGPT-4 depending on the language used by question type.</p>
Full article ">Figure 4
<p>Performance of ChatGPT-3.5 and ChatGPT-4 depending on question type.</p>
Full article ">
21 pages, 473 KiB  
Article
Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models
by Matheus Dalmolin, Karolayne S. Azevedo, Luísa C. de Souza, Caroline B. de Farias, Martina Lichtenfels and Marcelo A. C. Fernandes
AI 2025, 6(1), 2; https://doi.org/10.3390/ai6010002 - 27 Dec 2024
Viewed by 1169
Abstract
This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian [...] Read more.
This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning. Full article
(This article belongs to the Section Medical & Healthcare AI)
Show Figures

Figure 1

Figure 1
<p>Flowchart of activities to obtain the most important characteristics in the classification.</p>
Full article ">Figure 2
<p>Number of samples existing in the database before applying the undersampling balancing technique.</p>
Full article ">Figure 3
<p>Training curves for different models: (<b>a</b>) Training curve for the DT model; (<b>b</b>) Training curve for the RF model; and (<b>c</b>) Training curve for the XGB model.</p>
Full article ">Figure 4
<p>SHAP summary plot for the decision tree multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.</p>
Full article ">Figure 5
<p>SHAP summary plot for the random forest multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.</p>
Full article ">Figure 6
<p>SHAP summary plot for the XGBoost multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.</p>
Full article ">
33 pages, 3678 KiB  
Article
A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections
by Nikolai A. K. Steur and Friedhelm Schwenker
AI 2025, 6(1), 1; https://doi.org/10.3390/ai6010001 - 24 Dec 2024
Viewed by 1003
Abstract
Background: Integrating nonlinear behavior into the architecture of artificial neural networks is regarded as essential requirement to constitute their effectual learning capacity for solving complex tasks. This claim seems to be true for moderate-sized networks, i.e., with a lower double-digit number of layers. [...] Read more.
Background: Integrating nonlinear behavior into the architecture of artificial neural networks is regarded as essential requirement to constitute their effectual learning capacity for solving complex tasks. This claim seems to be true for moderate-sized networks, i.e., with a lower double-digit number of layers. However, going deeper with neural networks regularly turns into destructive tendencies of gradual performance degeneration during training. To circumvent this degradation problem, the prominent neural architectures Residual Network and Highway Network establish skip connections with additive identity mappings between layers. Methods: In this work, we unify the mechanics of both architectures into Capsule Networks (CapsNet)s by showing their inherent ability to learn skip connections. As a necessary precondition, we introduce the concept of Adaptive Nonlinearity Gates (ANG)s which dynamically steer and limit the usage of nonlinear processing. We propose practical methods for the realization of ANGs including biased batch normalization, the Doubly-Parametric ReLU (D-PReLU) activation function, and Gated Routing (GR) dedicated to extremely deep CapsNets. Results: Our comprehensive empirical study using MNIST substantiates the effectiveness of our developed methods and delivers valuable insights for the training of very deep nets of any kind. The final experiments on Fashion-MNIST and SVHN demonstrate the potential of pure capsule-driven networks with GR. Full article
Show Figures

Figure 1

Figure 1
<p>Visualization of the degradation problem in relation to the network depth based on (<b>a</b>) plain networks and (<b>b</b>) CapsNets with distinct activation functions, using the MNIST classification dataset. A plain network contains 32 neurons per layer, while a CapsNet consists of eight capsules with four neurons each. Network depth is stated as the number of intermediate blocks, including an introducing convolutional layer and a closing classification head. Each block consists of a fully connected layer followed by BN and the application of the activation function. In the case of CapsNets, signal flow between consecutive capsule layers is controlled by a specific routing procedure. The final loss (as cross-entropy) and accuracy, both based on the training set, are reported as an average over five runs with random network initialization. Each run comprises <math display="inline"><semantics> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </semantics></math> training epochs, where <span class="html-italic">n</span> equals the number of intermediate blocks.</p>
Full article ">Figure 2
<p>Shortcut and skip connections (highlighted in red) in residual learning. (<b>a</b>) Original definition of a shortcut connection with projection matrix based on [<a href="#B5-ai-06-00001" class="html-bibr">5</a>]. (<b>b</b>) Pattern for self-building skip connections in a CapsNet with SR and an activation function with a suitable linear interval.</p>
Full article ">Figure 3
<p>Replacement of the static signal propagation in a CapsNet with a nonlinear routing procedure to form parametric information flow gates. (<b>a</b>) Basic pattern with a single routing gate. (<b>b</b>) Exemplary skip path (highlighted in red) crossing multiple layers and routing gates.</p>
Full article ">Figure 4
<p>Customizing the initialization scheme for BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> allows the training of deeper networks by constraining the input distribution (in blue) of an activation function to be positioned in a mostly linear section. Exemplary initializations are shown for (<b>a</b>) sigmoid with BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mn>0.5</mn> <mo>)</mo> </mrow> </semantics></math>, and (<b>b</b>) Leaky ReLU with BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mo>−</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics></math>.</p>
Full article ">Figure 5
<p>Parametric versions of ReLU with (<b>a</b>) single and (<b>b</b>) four degree(s) of freedom using an exemplary parameter range of <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>i</mi> </msub> <mo>∈</mo> <mrow> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> </mrow> </semantics></math>. (<b>a</b>) PReLU learns a nonlinearity specification <math display="inline"><semantics> <mi>ρ</mi> </semantics></math> for input values below zero and directly passes signals above zero. (<b>b</b>) SReLU applies the identity function within the interval <math display="inline"><semantics> <mrow> <mo>[</mo> <msub> <mi>t</mi> <mi>min</mi> </msub> <mo>,</mo> <msub> <mi>t</mi> <mi>max</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, and learns two individual nonlinearity specifications <math display="inline"><semantics> <msub> <mi>ρ</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>ρ</mi> <mn>2</mn> </msub> </semantics></math> outside of the centered interval.</p>
Full article ">Figure 6
<p>(<b>a</b>) Generic model architecture with (<b>b</b>) one-layer Feature Extractor (FE), a classification head with <span class="html-italic">z</span> classes and (<b>c</b>) intermediate blocks consisting of fully-connected layers. Dense blocks are specified via capsules or scalar neurons (plain) for the fully-connected units.</p>
Full article ">Figure 7
<p><b>First two rows:</b> Mean (first row) and best (second row) training loss progressions over five runs for each BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> initialization scheme per activation function. <b>Last two rows:</b> Mean deviation per BN layer of the final <math display="inline"><semantics> <msub> <mi>β</mi> <mi>i</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>γ</mi> <mi>i</mi> </msub> </semantics></math> parameters from their initial values, using the identified superior BN initialization scheme for each activation function. Per plot the model parameter deviations are shown for the best run and as average over all five runs.</p>
Full article ">Figure 8
<p>(<b>a</b>) Mean and (<b>b</b>) best training loss development over five runs using 90 intermediate blocks, AMSGrad and the superior BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> initialization strategy per activation function. Both subfigures provide an inset as a zoom-in for tight regions.</p>
Full article ">Figure 9
<p>(<b>a</b>) Percentage gain in accuracy for the remaining epochs measured in relation to the final accuracy. Accuracy gains below one percentage (red line) are gray. (<b>b</b>) Mean training loss development over five runs for varying network depths using ReLU, AMSGrad and BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics></math> initialization strategy.</p>
Full article ">Figure 10
<p>Each row summarizes the experiment results of the parametric activation functions PReLU, SReLU/D-PReLU and APLU, respectively. <b>First two columns:</b> Mean (first column) and best (second column) training loss development over five runs using AMSGrad and varying initialization strategies for BN<math display="inline"><semantics> <mrow> <mo>(</mo> <mi>β</mi> <mo>,</mo> <mi>γ</mi> <mo>)</mo> </mrow> </semantics></math> and the activation function parameters. Insets are provided as zoom-in for tight regions. <b>Second two columns:</b> Mean parameter deviations per layer from their initial values with respect to BN and the parametric activation function. In each case, the identified superior configuration strategy is used. For APLU the configuration with <math display="inline"><semantics> <mrow> <mi>s</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> is preferred against <math display="inline"><semantics> <mrow> <mi>s</mi> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math> for the benefit of proper visualization. <b>Last column:</b> Mean training loss progress over five runs for varying network depths using the identified superior configuration strategy.</p>
Full article ">Figure 11
<p>Mean training loss development over five runs using CapsNets with a depth of 500 intermediate blocks and varying routing procedures, activation functions and BN initializations.</p>
Full article ">Figure 12
<p>(<b>a</b>) Mean training (<b>solid</b>) and validation (<b>dotted</b>) loss progressions over five runs for the pure capsule-driven architecture. (<b>b</b>) Mean bias parameter deviation of GR after training from their initial value of <math display="inline"><semantics> <mrow> <mo>−</mo> <mn>3</mn> </mrow> </semantics></math>.</p>
Full article ">Figure A1
<p>Row-wise 20 random samples for each dataset in <a href="#ai-06-00001-t0A1" class="html-table">Table A1</a>.</p>
Full article ">Figure A2
<p>Final training loss (<b>left</b>) and training accuracy (<b>right</b>) averaged over five runs using CapsNets with increasing network depth and distinct configurations.</p>
Full article ">Figure A3
<p>Convolutional capsule unit with GR between two layers of identical dimensionality and image downsampling using grouped convolutions.</p>
Full article ">
Previous Issue
Next Issue
Back to TopTop