[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (415)

Search Parameters:
Keywords = speech emotions

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 1911 KiB  
Article
Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition
by Lukasz Smietanka and Tomasz Maka
Appl. Sci. 2025, 15(5), 2598; https://doi.org/10.3390/app15052598 - 27 Feb 2025
Viewed by 237
Abstract
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and [...] Read more.
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and EmoDB and RAVDESS databases. As the results show, the proposed system improved the classification accuracy of both representations: 1.29% for CQT and 3.75% for Mel spectrogram compared to the typical CNN architecture for the EmoDB dataset and 3.02% for CQT and 0.63% for Mel spectrogram in the case of RAVDESS. Additionally, the results present a significant increase of around 14% in classification performance in the case of happiness and disgust emotions using Mel spectrograms and around 20% in happiness and disgust emotions for CQT in the case of best models trained on EmoDB. On the other hand, in the case of models that achieved the highest result for the RAVDESS database, the most significant improvement was observed in the classification of a neutral state, around 16%, using the Mel spectrogram. For CQT representation, the most significant improvement occurred for fear and surprise, around 9%. Additionally, the average results for all prepared models showed the positive impact of the method used on the quality of classification of most emotional states. For the EmoDB database, the highest average improvement was observed for happiness—14.6%. For other emotions, it ranged from 1.2% to 8.7%. The only exception was the emotion of sadness, for which the classification quality was average decreased by 1% when using the Mel spectrogram. In turn, for the RAVDESS database, the most significant improvement also occurred for happiness—7.5%, while for other emotions ranged from 0.2% to 7.1%, except disgust and calm, the classification of which deteriorated for the Mel spectrogram and the CQT representation, respectively. Full article
Show Figures

Figure 1

Figure 1
<p>The proposed architecture of the injection mechanism.</p>
Full article ">Figure 2
<p>Low–level features (<math display="inline"><semantics> <msub> <mi>γ</mi> <mn>2</mn> </msub> </semantics></math>) used in injection process.</p>
Full article ">Figure 3
<p>Audio representation for selected recording (03a04Nc.wav) with a neutral emotional state. From top to bottom: CQT spectrogram, the Mel spectrogram, fundamental frequency, and formants.</p>
Full article ">Figure 4
<p>The distribution of recordings in EmoDB database.</p>
Full article ">Figure 5
<p>Accuracy distribution for models trained on CQT and Mel representations without injection mechanism for EmoDB (<b>a</b>) and RAVDES (<b>b</b>).</p>
Full article ">Figure 6
<p>Accuracy distribution for models trained on CQT and Mel representations with injection mechanism for EmoDB (<b>a</b>) and RAVDES (<b>b</b>).</p>
Full article ">Figure 7
<p>The classification accuracy relationship with the target feature space size and <math display="inline"><semantics> <mi>β</mi> </semantics></math> coefficient for models without (<b>a</b>,<b>c</b>) and with (<b>b</b>,<b>d</b>) injection mechanism. The top row is for the EmoDB database, and the bottom is for the RAVDESS database.</p>
Full article ">Figure 8
<p>The confusion matrices for the best models trained on CQT representations without (<b>a</b>) and with injection mechanism (<b>b</b>)—EmoDB.</p>
Full article ">Figure 9
<p>The confusion matrices for the best models trained on Mel spectrograms without (<b>a</b>) and with injection mechanism (<b>b</b>)—EmoDB.</p>
Full article ">Figure 10
<p>The confusion matrices for the best models trained on CQT representations without (<b>a</b>) and with injection mechanism (<b>b</b>)—RAVDESS.</p>
Full article ">Figure 11
<p>The confusion matrices for the best models trained on Mel representations without (<b>a</b>) and with injection mechanism (<b>b</b>)—RAVDESS.</p>
Full article ">Figure 12
<p>The accuracy distribution of individual emotions for models with <span class="html-italic">M</span> size for which the highest improvement occurred. Models without injection mechanisms trained on CQT and Mel spectrograms for EmoDB database (<b>a</b>,<b>b</b>) for RAVDESS database.</p>
Full article ">Figure 13
<p>The accuracy distribution of individual emotions for models with <span class="html-italic">M</span> size for which the highest improvement occurred. Models with injection mechanisms trained on CQT and Mel spectrograms for EmoDB database (<b>a</b>,<b>b</b>) for RAVDESS database.</p>
Full article ">
19 pages, 3137 KiB  
Article
Investigating Neurophysiological, Perceptual, and Cognitive Mechanisms in Misophonia
by Chhayakanta Patro, Emma Wasko, Prashanth Prabhu and Nirmal Kumar Srinivasan
Biology 2025, 14(3), 238; https://doi.org/10.3390/biology14030238 - 26 Feb 2025
Viewed by 248
Abstract
Misophonia is a condition characterized by intense, involuntary distress or anger in response to specific sounds, often leading to irritation or aggression. While the condition is recognized for its emotional and behavioral impacts, little is known about its physiological and perceptual effects. The [...] Read more.
Misophonia is a condition characterized by intense, involuntary distress or anger in response to specific sounds, often leading to irritation or aggression. While the condition is recognized for its emotional and behavioral impacts, little is known about its physiological and perceptual effects. The current study aimed to explore the physiological correlates and perceptual consequences of misophonia through a combination of electrophysiological, perceptual, and cognitive assessments. Seventeen individuals with misophonia and sixteen control participants without the condition were compared. Participants completed a comprehensive battery of tests, including (a) cortical event-related potentials (ERPs) to assess neural responses to standard and deviant auditory stimuli, (b) the spatial release from the speech-on-speech masking (SRM) paradigm to evaluate speech segregation in background noise, and (c) the flanker task to measure selective attention and cognitive control. The results revealed that individuals with misophonia exhibited significantly smaller mean peak amplitudes of the N1 and N2 components in response to oddball tones compared to controls. This suggests a potential underlying neurobiological deficit in misophonia patients, as these components are associated with early auditory processing. However, no significant differences between each group were observed in the P1 and P2 components regarding oddball tones or in any ERP components in response to standard tones. Despite these altered neural responses, the misophonia group did not show differences in hearing thresholds, speech perception abilities, or cognitive function compared to the controls. These findings suggest that while misophonia may involve distinct neurophysiological changes, particularly in early auditory processing, it does not necessarily lead to perceptual deficits in speech perception or cognitive function. Full article
(This article belongs to the Special Issue Neural Correlates of Perception in Noise in the Auditory System)
Show Figures

Figure 1

Figure 1
<p>Audiometric characteristics: (<b>A</b>) Individual and mean audiograms (averaged across ears) for the control group (<b>A</b>) and misophonia group (<b>B</b>). Thin colored lines represent individual thresholds, while bold lines indicate the group mean thresholds. (<b>C</b>) Comparison of individual and mean pure-tone average thresholds between the control and misophonia groups.</p>
Full article ">Figure 2
<p>Misophonia severity: Individual data distribution (unfilled circles) and group mean A-MISO-S score (filled circle) are shown, highlighting the variability within the sample and the average severity. The error bars indicate ±1 standard error (SE). The degree of misophonia severity, as defined by Schröder et al. [<a href="#B8-biology-14-00238" class="html-bibr">8</a>], is also included for reference.</p>
Full article ">Figure 3
<p>ERP amplitude: Mean ERP amplitudes for the control and misophonia groups obtained using (<b>A</b>) standard stimuli and (<b>B</b>) deviant stimuli. Groups are color-coded for clarity, with the control group in black and the misophonia group in red. (<b>C</b>) Individual data distribution (unfilled circles) and mean ΔN1 amplitudes, and (<b>D</b>) individual data distribution and mean MMN amplitudes for the misophonia and control groups. Error bars in all figures represent ±1 SE.</p>
Full article ">Figure 4
<p>ERP latency: Mean ERP latencies for the control and misophonia groups obtained using (<b>A</b>) standard stimuli and (<b>B</b>) deviant stimuli. (<b>C</b>) Individual data distribution (unfilled circles) and mean ΔN1 latencies, and (<b>D</b>) individual data distribution and mean MMN latencies for the misophonia and control groups. Error bars in all figures represent ±1 SE.</p>
Full article ">Figure 5
<p>Spatial release from masking (SRM): Mean TMRs are plotted for colocated conditions (target and maskers at 0° azimuth) and spatially separated conditions (target at 0° and maskers at ±15° azimuths) for the control and misophonia groups. The amount of SRM for both groups is also presented. Error bars represent ±1 SE.</p>
Full article ">Figure 6
<p>Flanker task: Mean completion times for the congruent and incongruent sections of the flanker task, along with the conflict cost (the difference in completion times between these two conditions), are presented. Error bars represent ±1 SE.</p>
Full article ">Figure 7
<p>Multiple correlations: A heat map illustrating the strength of associations between the various measures used in this study. Significant correlations are denoted by asterisks (*** <span class="html-italic">p</span> &lt; 0.001 and ** <span class="html-italic">p</span> &lt; 0.01) at the uncorrected significance level for multiple comparisons. Nonsignificant correlations are not marked with asterisks.</p>
Full article ">
18 pages, 913 KiB  
Article
Improving Stuttering Through Augmented Multisensory Feedback Stimulation
by Giovanni Muscarà, Alessandra Vergallito, Valentina Letorio, Gaia Iannaccone, Martina Giardini, Elena Randaccio, Camilla Scaramuzza, Cristina Russo, Maria Giovanna Scarale and Jubin Abutalebi
Brain Sci. 2025, 15(3), 246; https://doi.org/10.3390/brainsci15030246 - 25 Feb 2025
Viewed by 328
Abstract
Background/Objectives: Stuttering is a speech disorder involving fluency disruptions like repetitions, prolongations, and blockages, often leading to emotional distress and social withdrawal. Here, we present Augmented Multisensory Feedback Stimulation (AMFS), a novel personalized intervention to improve speech fluency in people who stutter (PWS). [...] Read more.
Background/Objectives: Stuttering is a speech disorder involving fluency disruptions like repetitions, prolongations, and blockages, often leading to emotional distress and social withdrawal. Here, we present Augmented Multisensory Feedback Stimulation (AMFS), a novel personalized intervention to improve speech fluency in people who stutter (PWS). AMFS includes a five-day intensive phase aiming at acquiring new skills, plus a reinforcement phase designed to facilitate the transfer of these skills across different contexts and their automatization into effortless behaviors. The concept of our intervention derives from the prediction of the neurocomputational model Directions into Velocities of Articulators (DIVA). The treatment applies dynamic multisensory stimulation to disrupt PWS’ maladaptive over-reliance on sensory feedback mechanisms, promoting the emergence of participants’ natural voices. Methods: Forty-six PWS and a control group, including twenty-four non-stuttering individuals, participated in this study. Stuttering severity and physiological measures, such as heart rate and electromyographic activity, were recorded before and after the intensive phase and during the reinforcement stage in the PWS but only once in the controls. Results: The results showed a significant reduction in stuttering severity at the end of the intensive phase, which was maintained during the reinforcement training. Crucially, worse performance was found in PWS than in the controls at baseline but not after the intervention. In the PWS, physiological signals showed a reduction in activity during the training phases compared to baseline. Conclusions: Our findings show that AMFS provides a promising approach to enhancing speech fluency. Future studies should clarify the mechanisms underlying such intervention and assess whether effects persist after the treatment conclusion. Full article
(This article belongs to the Special Issue Latest Research on the Treatments of Speech and Language Disorders)
Show Figures

Figure 1

Figure 1
<p>The figure illustrates some of the equipment used for AMFS. The <b>left</b> image shows a participant with biofeedback sensors during a fluency assessment session. The <b>right</b> image shows a participant during a rehabilitation session in the post-intensive phase, which includes auditory (microphone + headphones), visual (monitor), and vibrotactile (vest) stimulation.</p>
Full article ">Figure 2
<p>The figure depicts the different performance levels in SSI-4 total scores. The <span class="html-italic">y</span>-axis represents the number of participants at each time point (<span class="html-italic">x</span>-axis) and their levels of performance, which are depicted through different colors. Specifically, the color of the bar plot identifies a gradient of disease. The green color represents a disease score equal to 0, the shades of blue represent a disease score between 1 and 15 (lighter blue indicates a lower score, while darker blue indicates a higher score within the 1–15 range), and the red represents a disease score equal to 16.</p>
Full article ">
29 pages, 3263 KiB  
Article
Gamified Engagement for Data Crowdsourcing and AI Literacy: An Investigation in Affective Communication Through Speech Emotion Recognition
by Eleni Siamtanidou, Lazaros Vrysis, Nikolaos Vryzas and Charalampos A. Dimoulas
Societies 2025, 15(3), 54; https://doi.org/10.3390/soc15030054 - 22 Feb 2025
Viewed by 296
Abstract
This research investigates the utilization of entertainment approaches, such as serious games and gamification technologies, to address various challenges and implement targeted tasks. Specifically, it details the design and development of an innovative gamified application named “J-Plus”, aimed at both professionals and non-professionals [...] Read more.
This research investigates the utilization of entertainment approaches, such as serious games and gamification technologies, to address various challenges and implement targeted tasks. Specifically, it details the design and development of an innovative gamified application named “J-Plus”, aimed at both professionals and non-professionals in journalism. This application facilitates the enjoyable, efficient, and high-quality collection of emotionally tagged speech samples, enhancing the performance and robustness of speech emotion recognition (SER) systems. Additionally, these approaches offer significant educational benefits, providing users with knowledge about emotional speech and artificial intelligence (AI) mechanisms while promoting digital skills. This project was evaluated by 48 participants, with 44 engaging in quantitative assessments and 4 forming an expert group for qualitative methodologies. This evaluation validated the research questions and hypotheses, demonstrating the application’s diverse benefits. Key findings indicate that gamified features can effectively support learning and attract users, with approximately 70% of participants agreeing that serious games and gamification could enhance their motivation to practice and improve their emotional speech. Additionally, 50% of participants identified social interaction features, such as collaboration, as most beneficial for fostering motivation and commitment. The integration of these elements supports reliable and extensive data collection and the advancement of AI algorithms while concurrently developing various skills, such as emotional speech articulation and digital literacy. This paper advocates for the creation of collaborative environments and digital communities through crowdsourcing, balancing technological innovation in the SER sector. Full article
Show Figures

Figure 1

Figure 1
<p>The design of the gamified application incorporates SER and crowdsourcing systems with the objective of enhancing and augmenting the SER database. The blueprint illustrates the potential for the creation of digital societies through the association of this design.</p>
Full article ">Figure 2
<p>The combination of three categories of emotional speech data collection serves to qualitatively enrich the database and achieve the maximum performance of the SER systems.</p>
Full article ">Figure 3
<p>This figure presents the original user interfaces (<b>a</b>,<b>b</b>) and the modules designated as the “news anchor” and “emotional diary” ((<b>c</b>) and (<b>d</b>), respectively). The presented high-fidelity prototypes represent the initial version of the gamified application.</p>
Full article ">Figure 4
<p>The main “J-Plus” interfaces: (<b>a</b>) Start screen; (<b>b</b>) Main menu screen (about; game; news anchor; emotional diary; profile).</p>
Full article ">Figure 5
<p>The main “J-Plus” interfaces are accessible via the “about” section, which provides detailed information regarding the utilization and functionality of each discrete module.</p>
Full article ">Figure 6
<p>The main screens of the “J-PLUS” application are illustrated. These two screens relate to the “news anchor” category, offering the options of “practice in journalistic speech” and “recognition of journalists’ emotional speech.</p>
Full article ">Figure 7
<p>Two screenshots are provided from the J-Plus application, which illustrates the “emotional diary” and “profile” sections.</p>
Full article ">
20 pages, 10576 KiB  
Article
Clinical Research on Positron Emission Tomography Imaging of the Neuro-Stimulation System in Patients with Cochleo-Vestibular Implants: Is There a Response Beyond the Peripheral Organ?
by Joan Lorente-Piera, Elena Prieto, Ángel Ramos de Miguel, Manuel Manrique, Nicolás Pérez-Fernández, Ángel Ramos Macías, Jaime Monedero Afonso, Alina Sanfiel Delgado, Jorge Miranda Ramos, Paula Alonso Alonso, Javier Arbizu and Raquel Manrique-Huarte
J. Clin. Med. 2025, 14(5), 1445; https://doi.org/10.3390/jcm14051445 - 21 Feb 2025
Viewed by 226
Abstract
Introduction: In patients refractory to vestibular rehabilitation in the management of bilateral vestibulopathy, the cochleo-vestibular implant has emerged as a viable alternative to enhance both audiovestibular function and quality of life. The main objective of this study is to pioneer the use of [...] Read more.
Introduction: In patients refractory to vestibular rehabilitation in the management of bilateral vestibulopathy, the cochleo-vestibular implant has emerged as a viable alternative to enhance both audiovestibular function and quality of life. The main objective of this study is to pioneer the use of PET to assess cortical modifications in patients with cochleo-vestibular implants, aiming to evaluate the safety and functional improvements in individuals with bilateral vestibulopathy and severe to profound hearing loss. Methods: A phase I pilot clinical trial was conducted with participants who received a BIONIC-VEST CI24RE cochleo-vestibular implant, with pre- and post-implantation assessments conducted for twelve months. Audiovestibular testing and two PET studies with 18F-FDG under baseline conditions and with active stimulus to observe cortical-level differences were performed. Results: Five patients were included in the study, all of them treated with a cochleo-vestibular implant, none of whom presented postoperative adverse effects. Audiologically, the mean post-implant gain was 56.63 ± 14.53 dB and 50.40 ± 35.54% in terms of speech intelligibility. From a vestibular perspective, the most remarkable findings were observed at the graviceptive pathway level, where a mean posturographic improvement was observed, with a sensory organization test score of 24.20 ± 13.74 and a subjective visual vertical of 1.57° ± 0.79°, achieving, in most cases, results within the normal range (<2.3°) by the end of the follow-up. PET images confirmed that with the electrical stimulus active (implant ON), there was a supratentorial activation pattern, particularly in areas related to somatosensory integration, emotional regulation, and autonomic control. Conclusions: The BIONIC-VEST implant significantly improved the vestibular system, particularly the graviceptive pathway, enhancing balance and SVV and reducing fall risk. PET revealed distinct uptake patterns in baseline and activated conditions, highlighting a cortical-level response with the use of the cochleo-vestibular implant. Full article
(This article belongs to the Special Issue Current Updates on the Inner Ear)
Show Figures

Figure 1

Figure 1
<p>Summary of the postoperative follow-up conducted on the patients included in the study.</p>
Full article ">Figure 2
<p>Summary of the PET subtraction algorithm applied in our study. Subtraction maps are presented over a standard MRI image.</p>
Full article ">Figure 3
<p>Cortical representation and summary of the different areas studied in the clinical trial using PET-CT imaging.</p>
Full article ">Figure 4
<p>Progression of auditory performance recorded in the PTA (<b>left</b> panel) and rate of discrimination (<b>right</b> panel) in the ipsilateral ears of each patient included in the study.</p>
Full article ">Figure 5
<p>(<b>A</b>). Summary of the different gains recorded in the vHIT in patients from the trial, analyzing the three different canals separately. The blue colors represent the ear ipsilateral to the cochleo-vestibular implant, while the purple colors represent the contralateral ears. (<b>B</b>). Example of vHIT results for the lateral semicircular canals of one of the patients in the trial. It shows two scenarios: the top image corresponds to the pre-implantation phase, and the bottom image shows the post-implantation phase with the cochleo-vestibular implant (CVI). Despite the lack of improvement in gain, clear refixation saccade phenomena can be observed. (<b>C</b>). Representation of the evolution of the different quotients included in the SOT. The dark blue color corresponds to the pre-implantation moment and the light blue color corresponds to post-implantation. SOMATO: somatosensorial; VESTIB: vestibular; PREF: visual preference.</p>
Full article ">Figure 5 Cont.
<p>(<b>A</b>). Summary of the different gains recorded in the vHIT in patients from the trial, analyzing the three different canals separately. The blue colors represent the ear ipsilateral to the cochleo-vestibular implant, while the purple colors represent the contralateral ears. (<b>B</b>). Example of vHIT results for the lateral semicircular canals of one of the patients in the trial. It shows two scenarios: the top image corresponds to the pre-implantation phase, and the bottom image shows the post-implantation phase with the cochleo-vestibular implant (CVI). Despite the lack of improvement in gain, clear refixation saccade phenomena can be observed. (<b>C</b>). Representation of the evolution of the different quotients included in the SOT. The dark blue color corresponds to the pre-implantation moment and the light blue color corresponds to post-implantation. SOMATO: somatosensorial; VESTIB: vestibular; PREF: visual preference.</p>
Full article ">Figure 6
<p>Representation of the evolution of SVV in the different subjects. The dark blue color corresponds to the pre-implantation moment and the light blue color corresponds to post-implantation.</p>
Full article ">Figure 7
<p>Subtraction PET images illustrating the metabolic changes between baseline conditions and electrical stimulation. The yellow circle in patient 3 represents notable uptake at the ipsilateral frontal operculum related to the implant, while the blue circle in the fourth patient shows a decrease in uptake in the ipsilateral precentral gyrus, and the yellow circle in the fifth subject indicates an increase in uptake with the presence of electrical stimulation in the ipsilateral inferior frontal gyrus.</p>
Full article ">
21 pages, 7761 KiB  
Article
Acoustic Feature Excitation-and-Aggregation Network Based on Multi-Task Learning for Speech Emotion Recognition
by Xin Qi, Qing Song, Guowei Chen, Pengzhou Zhang and Yao Fu
Electronics 2025, 14(5), 844; https://doi.org/10.3390/electronics14050844 - 21 Feb 2025
Viewed by 154
Abstract
In recent years, substantial research has focused on emotion recognition using multi-stream speech representations. In existing multi-stream speech emotion recognition (SER) approaches, effectively extracting and fusing speech features is crucial. To overcome the bottleneck in SER caused by the fusion of inter-feature information, [...] Read more.
In recent years, substantial research has focused on emotion recognition using multi-stream speech representations. In existing multi-stream speech emotion recognition (SER) approaches, effectively extracting and fusing speech features is crucial. To overcome the bottleneck in SER caused by the fusion of inter-feature information, including challenges like modeling complex feature relations and the inefficiency of fusion methods, this paper proposes an SER framework based on multi-task learning, named AFEA-Net. The framework consists of a speech emotion alignment learning (SEAL), an acoustic feature excitation-and-aggregation mechanism (AFEA), and a continuity learning. First, SEAL aligns sentiment information between WavLM and Fbank features. Then, we design an acoustic feature excitation-and-aggregation mechanism to adaptively calibrate and merge the two features. Furthermore, we introduce a continuity learning strategy to explore the distinctiveness and complementarity of dual-stream features from intra- and inter-speech. Experimental results on the publicly available IEMOCAP and RAVDESS sentiment datasets show that our proposed approach outperforms state-of-the-art SER approaches. Specifically, we achieve 75.1% WA, 75.3% UAR, 76% precision, and 75.4% F1-score on IEMOCAP, and 80.3%, 80.6%, 80.8%, and 80.4% on RAVDESS, respectively. Full article
Show Figures

Figure 1

Figure 1
<p>The proposed AFEA-Net architecture. The architecture mainly comprises SEAL, multi-layer AFEA modules (<span class="html-italic">l</span> denotes number of layers), and continuity learning (the dashed section). In this setup, <math display="inline"><semantics> <msubsup> <mi>F</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> <mi>l</mi> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>F</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> <mi>l</mi> </msubsup> </semantics></math> are derived by averaging the fusion feature with <math display="inline"><semantics> <msubsup> <mi>F</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> <mrow> <mi>l</mi> <mo>−</mo> <mn>1</mn> </mrow> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>F</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> <mrow> <mi>l</mi> <mo>−</mo> <mn>1</mn> </mrow> </msubsup> </semantics></math> from the preceding layer of AFEA modules.</p>
Full article ">Figure 2
<p>Label value in SEAL. The horizontal axis represents WavLM, while the vertical axis represents Fbank. If the emotion labels are identical, <math display="inline"><semantics> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>; otherwise, <math display="inline"><semantics> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 3
<p>Architecture of the AFEA mechanism. AFEA comprises the inter-speech excitation block (ISE) and the inter-speech aggregation block (ISA). <math display="inline"><semantics> <msub> <mi>S</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>c</mi> <mi>a</mi> <mi>t</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>I</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>c</mi> <mi>a</mi> <mi>t</mi> </mrow> </msub> </semantics></math> represent concatenation features, <math display="inline"><semantics> <msub> <mi>G</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>G</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> </msub> </semantics></math> are mapped features, <math display="inline"><semantics> <msub> <mi>A</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>A</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> </msub> </semantics></math> are weight matrices, and <math display="inline"><semantics> <msub> <mi>E</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>E</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> </msub> </semantics></math> are excitation matrices. <math display="inline"><semantics> <msub> <mi>I</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>I</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> </msub> </semantics></math> are output features of ISE. <math display="inline"><semantics> <msub> <mi>W</mi> <mrow> <mi>w</mi> <mi>a</mi> <mi>v</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>W</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>l</mi> </mrow> </msub> </semantics></math> are weight vectors. Finally, the two features are fused. <span class="html-italic">C</span> represents the concatenation operation, and <math display="inline"><semantics> <mi>σ</mi> </semantics></math> represents the Sigmoid activation function.</p>
Full article ">Figure 4
<p>Accuracy of different margin values on IEMOCAP and RAVDESS.</p>
Full article ">Figure 5
<p>Accuracy of different <math display="inline"><semantics> <mi>α</mi> </semantics></math>, <math display="inline"><semantics> <mi>β</mi> </semantics></math> and <math display="inline"><semantics> <mi>γ</mi> </semantics></math> values on IEMOCAP and RAVDESS.</p>
Full article ">Figure 6
<p>Confusion matrix on IEMOCAP and RAVDESS datasets. (<b>a</b>) IEMOCAP; (<b>b</b>) RAVDESS. 0-sad, 1-hap, 2-ang, 3-neu, 4-calm, 5-fea, 6-dis, 7-sur).</p>
Full article ">Figure 7
<p>Feature distribution on IEMOCAP. (<b>a</b>) Initial state; (<b>b</b>) after AFEA-Net.</p>
Full article ">Figure 8
<p>Feature distribution on RAVDESS. (<b>a</b>) Initial state; (<b>b</b>) after AFEA-Net.</p>
Full article ">
15 pages, 587 KiB  
Systematic Review
AI Applications to Reduce Loneliness Among Older Adults: A Systematic Review of Effectiveness and Technologies
by Yuyi Yang, Chenyu Wang, Xiaoling Xiang and Ruopeng An
Healthcare 2025, 13(5), 446; https://doi.org/10.3390/healthcare13050446 - 20 Feb 2025
Viewed by 374
Abstract
Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in [...] Read more.
Background/Objectives: Loneliness among older adults is a prevalent issue, significantly impacting their quality of life and increasing the risk of physical and mental health complications. The application of artificial intelligence (AI) technologies in behavioral interventions offers a promising avenue to overcome challenges in designing and implementing interventions to reduce loneliness by enabling personalized and scalable solutions. This study systematically reviews the AI-enabled interventions in addressing loneliness among older adults, focusing on the effectiveness and underlying technologies used. Methods: A systematic search was conducted across eight electronic databases, including PubMed and Web of Science, for studies published up to 31 January 2024. Inclusion criteria were experimental studies involving AI applications to mitigate loneliness among adults aged 55 and older. Data on participant demographics, intervention characteristics, AI methodologies, and effectiveness outcomes were extracted and synthesized. Results: Nine studies were included, comprising six randomized controlled trials and three pre–post designs. The most frequently implemented AI technologies included speech recognition (n = 6) and emotion recognition and simulation (n = 5). Intervention types varied, with six studies employing social robots, two utilizing personal voice assistants, and one using a digital human facilitator. Six studies reported significant reductions in loneliness, particularly those utilizing social robots, which demonstrated emotional engagement and personalized interactions. Three studies reported non-significant effects, often due to shorter intervention durations or limited interaction frequencies. Conclusions: AI-driven interventions show promise in reducing loneliness among older adults. Future research should focus on long-term, culturally competent solutions that integrate quantitative and qualitative findings to optimize intervention design and scalability. Full article
Show Figures

Figure 1

Figure 1
<p>PRISMA flow diagram.</p>
Full article ">
16 pages, 2242 KiB  
Article
Effective Data Augmentation Techniques for Arabic Speech Emotion Recognition Using Convolutional Neural Networks
by Wided Bouchelligua, Reham Al-Dayil and Areej Algaith
Appl. Sci. 2025, 15(4), 2114; https://doi.org/10.3390/app15042114 - 17 Feb 2025
Viewed by 337
Abstract
This paper investigates the effectiveness of various data augmentation techniques for enhancing Arabic speech emotion recognition (SER) using convolutional neural networks (CNNs). Utilizing the Saudi Dialect and BAVED datasets, we address the challenges of limited and imbalanced data commonly found in Arabic SER. [...] Read more.
This paper investigates the effectiveness of various data augmentation techniques for enhancing Arabic speech emotion recognition (SER) using convolutional neural networks (CNNs). Utilizing the Saudi Dialect and BAVED datasets, we address the challenges of limited and imbalanced data commonly found in Arabic SER. To improve model performance, we apply augmentation techniques such as noise addition, time shifting, increasing volume, and reducing volume. Additionally, we examine the optimal number of augmentations required to achieve the best results. Our experiments reveal that these augmentations significantly enhance the CNN’s ability to recognize emotions, with certain techniques proving more effective than others. Furthermore, the number of augmentations plays a critical role in balancing model accuracy. The Saudi Dialect dataset achieved its best results with two augmentations (increasing volume and decreasing volume), reaching an accuracy of 96.81%. Similarly, the BAVED dataset demonstrated optimal performance with a combination of three augmentations (noise addition, increasing volume, and reducing volume), achieving an accuracy of 92.60%. These findings indicate that carefully selected augmentation strategies can greatly improve the performance of CNN-based SER systems, particularly in the context of Arabic speech. This research underscores the importance of tailored augmentation techniques to enhance SER performance and sets a foundation for future advancements in this field. Full article
(This article belongs to the Special Issue Natural Language Processing: Novel Methods and Applications)
Show Figures

Figure 1

Figure 1
<p>Original Saudi Dialect dataset distribution.</p>
Full article ">Figure 2
<p>Original BAVED dataset distribution.</p>
Full article ">Figure 3
<p>The flow of data preparation for the SER.</p>
Full article ">Figure 4
<p>Examples of the audio files with data augmentation: (<b>a</b>) original audio for an angry emotion (01), (<b>b</b>) noise addition, (<b>c</b>) time shift, (<b>d</b>) increasing volume, and (<b>e</b>) reducing volume.</p>
Full article ">Figure 5
<p>The block diagram of the MFCC computation.</p>
Full article ">Figure 6
<p>The proposed SER architecture.</p>
Full article ">
17 pages, 3001 KiB  
Article
Performance Improvement of Speech Emotion Recognition Using ResNet Model with Data Augmentation–Saturation
by Minjeong Lee and Miran Lee
Appl. Sci. 2025, 15(4), 2088; https://doi.org/10.3390/app15042088 - 17 Feb 2025
Viewed by 164
Abstract
Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based [...] Read more.
Over the past five years, the proliferation of virtual reality platforms and the advancement of metahuman technologies have underscored the importance of natural interaction and emotional expression. As a result, there has been significant research activity focused on developing emotion recognition techniques based on speech data. Despite significant progress in emotion recognition research for the Korean language, a shortage of speech databases applicable to such research has been regarded as the most critical problem in this field, leading to overfitting issues in several models developed by previous studies. To address the issue of overfitting caused by limited data availability in the field of Korean speech emotion recognition (SER), this study focuses on integrating the data augmentation–saturation (DA-S) technique into a traditional ResNet model to enhance SER performance. The DA-S technique enhances data augmentation by adjusting the saturation of an image. We used 11,192 utterance numbers provided by AI-HUB, which were converted into images to extract features such as pitch and intensity of speech. The DA-S technique was then applied to this dataset, using weights of 0 and 2, to augment the utterance numbers to 33,576. This augmented dataset was utilized to classify four emotion categories: happiness, sadness, anger, and neutrality. The results of this study showed that the proposed model using the DA-S technique overcame overfitting issues. Furthermore, its performance for SER increased by 34.19% compared to that of existing ResNet models not using the DA-S technique. This demonstrates that the DA-S technique effectively enhances model performance with limited data and may be applicable to specific areas such as stress monitoring and mental health support. Full article
(This article belongs to the Special Issue Advanced Technologies and Applications of Emotion Recognition)
Show Figures

Figure 1

Figure 1
<p>Architecture of the model.</p>
Full article ">Figure 2
<p>Frequency characteristics of a basilar membrane and a mel filter bank: (<b>a</b>) the structure of a basilar membrane structure; (<b>b</b>) the structure of a mel filter bank.</p>
Full article ">Figure 3
<p>Comparative analysis of a spectrogram and a log mel-spectrogram. (<b>a</b>) A spectrogram. (<b>b</b>) A log mel-spectrogram (AIHub’s Data ID: 5f5b466c2e23c7161acccecb).</p>
Full article ">Figure 4
<p>Comparison between an original image and DA-S images. (<b>a</b>) The original image, (<b>b</b>) an image with 0% saturation, 0% (w = 0), (<b>c</b>) an image with 99.33% saturation (w = 2), and (<b>d</b>) an image with 99.64% saturation (w = 3).</p>
Full article ">Figure 5
<p>Residual block.</p>
Full article ">Figure 6
<p>Comparison of performance among ResNet models through k-fold cross validation (k = 5) according to layer depth. (<b>a</b>) Mean accuracy of the ResNet18 model. (<b>b</b>) Mean accuracy of the ResNet34 model. (<b>c</b>) Mean accuracy of the ResNet50 model.</p>
Full article ">Figure 7
<p>Comparative analysis of overfitting that occurred when using Original DB-T and when using the DB augmented with the DA-S technique. (<b>a</b>) Mean accuracies calculated based on Original DB-T. (<b>b</b>) Mean accuracies calculated based on DAS (0%, =0) and DAS (99.49%, =2).</p>
Full article ">
21 pages, 313 KiB  
Article
Out of the Mouths of Babes: Black Children’s Experiences of Emotion-Focused Racial–Ethnic Socialization, Coping, and Antiracist Resistance
by Emilie Phillips Smith, Simone E. Bibbs, Deborah J. Johnson, Lekie Dwanyen, Kendal Holtrop and LaVelle Gipson-Tansil
Behav. Sci. 2025, 15(2), 222; https://doi.org/10.3390/bs15020222 - 16 Feb 2025
Viewed by 656
Abstract
Black children in the U.S. learn from scaffolded parental teachings to help manage racial discrimination. Middle childhood is an understudied developmental period for this research. This paper builds upon research on culturally informed practices Black caregivers use to rear their young with a [...] Read more.
Black children in the U.S. learn from scaffolded parental teachings to help manage racial discrimination. Middle childhood is an understudied developmental period for this research. This paper builds upon research on culturally informed practices Black caregivers use to rear their young with a healthy identity and socio-emotional skills to navigate racism Guided by a phenomenological qualitative approach, we conducted focus groups with 39 Black children (Meanage = 7.67, 54% girls, 46% boys). Children reported that their parents imparted a sense of positive identity in terms of their cultural heritage, skin, and hair—areas in which they experienced frequent bullying. A uniqueness of our study is that Black children also reported learning emotion-centered coping strategies that focus on their inner strengths and private speech. They adopted a range of adaptive coping mechanisms such as kindness, ignoring perpetrators, centering their positive identity, identity framing, and fighting back. Through children’s voices, we build upon previous research integrating racial–ethnic socialization (RES) with socio-emotional competencies in response to discrimination. We underscore the importance of exploring racial–ethnic identity development and socialization in childhood, a developmental period in which these processes are understudied. Full article
28 pages, 9455 KiB  
Article
Advancing Emotionally Aware Child–Robot Interaction with Biophysical Data and Insight-Driven Affective Computing
by Diego Resende Faria, Amie Louise Godkin and Pedro Paulo da Silva Ayrosa
Sensors 2025, 25(4), 1161; https://doi.org/10.3390/s25041161 - 14 Feb 2025
Viewed by 387
Abstract
This paper investigates the integration of affective computing techniques using biophysical data to advance emotionally aware machines and enhance child–robot interaction (CRI). By leveraging interdisciplinary insights from neuroscience, psychology, and artificial intelligence, the study focuses on creating adaptive, emotion-aware systems capable of dynamically [...] Read more.
This paper investigates the integration of affective computing techniques using biophysical data to advance emotionally aware machines and enhance child–robot interaction (CRI). By leveraging interdisciplinary insights from neuroscience, psychology, and artificial intelligence, the study focuses on creating adaptive, emotion-aware systems capable of dynamically recognizing and responding to human emotional states. Through a real-world CRI pilot study involving the NAO robot, this research demonstrates how facial expression analysis and speech emotion recognition can be employed to detect and address negative emotions in real time, fostering positive emotional engagement. The emotion recognition system combines handcrafted and deep learning features for facial expressions, achieving an 85% classification accuracy during real-time CRI, while speech emotions are analyzed using acoustic features processed through machine learning models with an 83% accuracy rate. Offline evaluation of the combined emotion dataset using a Dynamic Bayesian Mixture Model (DBMM) achieved a 92% accuracy for facial expressions, and the multilingual speech dataset yielded 98% accuracy for speech emotions using the DBMM ensemble. Observations from psychological and technological aspects, coupled with statistical analysis, reveal the robot’s ability to transition negative emotions into neutral or positive states in most cases, contributing to emotional regulation in children. This work underscores the potential of emotion-aware robots to support therapeutic and educational interventions, particularly for pediatric populations, while setting a foundation for developing personalized and empathetic human–machine interactions. These findings demonstrate the transformative role of affective computing in bridging the gap between technological functionality and emotional intelligence across diverse domains. Full article
(This article belongs to the Special Issue Multisensory AI for Human-Robot Interaction)
Show Figures

Figure 1

Figure 1
<p>Overview of a child–robot interaction session with specific frames from each stage defined.</p>
Full article ">Figure 2
<p>Examples of facial landmark detection and expression recognition performed on images captured by the NAO robot’s camera. The images are intentionally filtered to blur the children’s faces, ensuring their privacy.</p>
Full article ">Figure 3
<p>Facial expression distribution per stage. This figure illustrates the distribution of emotions, including happy, neutral, afraid, sad, and surprised, observed across the five stages of the child–robot interaction.</p>
Full article ">Figure 4
<p>Speech emotion distribution per stage. Our approach could detect only neutral and positive emotions. Other emotions like fear, sadness, and anger were not detected.</p>
Full article ">Figure 5
<p>Text sentiment distribution per stage. Based on the sentiment analysis of text converted from speech, this figure shows the proportions of positive, neutral, and negative sentiments across the stages.</p>
Full article ">Figure 6
<p>Summary of positive, neutral, and negative emotional states per stage. This figure aggregates the results from facial expressions, speech emotions, and text sentiment, summarizing the emotional states into positive (happy and surprised), neutral, and negative (sad, afraid, angry, and disgusted).</p>
Full article ">Figure 7
<p>Emotional response distribution (children vs. mothers).</p>
Full article ">Figure 8
<p>Child emotional trends during interaction.</p>
Full article ">Figure 9
<p>Parent–child emotional concordance.</p>
Full article ">Figure 10
<p>Sample frames showcasing interactions between boys and girls with the NAO robot. The examples highlight their proxemics relative to the robot and instances where mothers participated during specific parts of the session. The images are captured from both the environment camera and the robot’s onboard camera.</p>
Full article ">
21 pages, 999 KiB  
Article
Harnessing Empathy: The Power of Emotional Resonance in Live Streaming Sales and the Moderating Magic of Product Type
by Shizhen Bai, Fang Jiang, Qiutong Li, Dingyao Yu and Yongbo Tan
J. Theor. Appl. Electron. Commer. Res. 2025, 20(1), 30; https://doi.org/10.3390/jtaer20010030 - 13 Feb 2025
Viewed by 572
Abstract
The emotional expressions in live streaming e-commerce possess a strong contagious effect, enabling viewers to easily resonate with the specific emotions conveyed by the streamers and consciously build an empathy transmission chain. This study constructs a regression model based on the emotional contagion [...] Read more.
The emotional expressions in live streaming e-commerce possess a strong contagious effect, enabling viewers to easily resonate with the specific emotions conveyed by the streamers and consciously build an empathy transmission chain. This study constructs a regression model based on the emotional contagion theory and explores the impact of empathy between streamers and viewers on sales performance. Using data from 30 live streams, totaling 22,707 min, from one of China’s most popular live streaming rooms, “East Buy”, between February and April 2024, we demonstrate the significant positive impact of empathy between streamers and viewers on sales. Additionally, product type positively moderates this relationship. The unexpected thing is that live streaming time does not significantly affect the relationship between empathy and sales. This study employs text sentiment analysis methods to extract emotional features from the streamers’ speech and real-time comments from viewers. Our research extends the application of emotional contagion theory to the context of live-streaming e-commerce, enriches the literature on emotional interaction in service marketing, and provides practical insights for live-streaming platforms and streamers. Streamers can optimize marketing strategies and achieve sales goals by creating a more engaging and empathetic live-streaming experience. Full article
(This article belongs to the Topic Interactive Marketing in the Digital Era)
Show Figures

Figure 1

Figure 1
<p>Research model.</p>
Full article ">Figure 2
<p>Empathy measurement.</p>
Full article ">
19 pages, 1687 KiB  
Article
Impact of Gentle Touch Stimulation Combined with Advanced Sensory Stimulation in Patients in a Minimally Conscious State: A Quasi-Randomized Clinical Trial
by Mirjam Bonanno, Antonio Gangemi, Rosa Angela Fabio, Marco Tramontano, Maria Grazia Maggio, Federica Impellizzeri, Alfredo Manuli, Daniele Tripoli, Angelo Quartarone, Rosaria De Luca and Rocco Salvatore Calabrò
Life 2025, 15(2), 280; https://doi.org/10.3390/life15020280 - 11 Feb 2025
Viewed by 379
Abstract
Touch, particularly affective touch mediated by C-tactile fibers, plays a key role in emotional regulation and therapeutic interventions. However, tactile stimulation is underutilized in sensory stimulation (SS) protocols for brain injury patients, despite its potential to enhance consciousness and promote recovery through neural [...] Read more.
Touch, particularly affective touch mediated by C-tactile fibers, plays a key role in emotional regulation and therapeutic interventions. However, tactile stimulation is underutilized in sensory stimulation (SS) protocols for brain injury patients, despite its potential to enhance consciousness and promote recovery through neural and autonomic regulation. Tools like the Neurowave enable advanced multisensory stimulation, including audio-visual and emotional inputs, but lack tactile components. Integrating gentle touch stimulation with such systems could further enhance neuroplasticity, improve heart rate regulation, and support recovery in patients with disorders of consciousness. In this study, twenty patients affected by minimally conscious state (MCS) were divided into two groups: an experimental group (EG n.10) and a control group (CG n.10). Both groups underwent standard neurorehabilitation, including conventional physiotherapy and speech therapy. The key difference was in the type of sensory stimulation. The EG received advanced sensory stimulation with the Neurowave system (which provides audio-visual and emotional sensory stimulation) in addition to gentle touch stimulation. The CG received conventional sensory stimulation without the Neurowave and neutral gentle touch stimulation. Each patient was evaluated by a multidisciplinary rehabilitation team, using clinical scales such as coma recovery scale—revised (CSR-R), level of cognitive functioning (LCF), before (T0) and after (T1) treatment. Additionally, heart rate (HR) and neurophysiological outcomes (P300) were also recorded for both groups (EG and CG). The MANOVA model revealed a significant interaction effect between group and phase on P300 latency (F (1, 18) = 10.23, p < 0.001, η2 = 0.09), indicating that the intervention involving gentle touch stimulation significantly influenced the P300 latency in the EG. The findings of this study contribute to our understanding of the therapeutic potential of emotional multisensory stimulation, which also includes gentle touch stimulation, in MCS rehabilitation. By demonstrating significant effects on both neurophysiological and functional measures, our results support the integration of tactile interventions into comprehensive neurorehabilitation programs. Full article
(This article belongs to the Special Issue Innovative Perspectives in Physical Therapy and Health)
Show Figures

Figure 1

Figure 1
<p>Schematic representation of audiovisual Neurowave stimulation plus gentle touch stimulation.</p>
Full article ">Figure 2
<p>Changes in P300 latency between the pre-test (T0) and post-test (T1) phases in both the EG and CG. Legend: Comparison of P300 latency between the experimental group (EG) and the control group (CG) at both pre-test and post-test. The dotted trend line shows the significant difference in P300 latency between the groups at post-test, with a ** <span class="html-italic">p</span>-value &lt; 0.01, indicating that the EG exhibited a distinct pattern in P300 latency compared to the CG only in the post-test.</p>
Full article ">Figure 3
<p>Changes in HR between the pre-test (T0) and post-test (T1) phases in both the EG and CG. Legend: Comparison of heart rate (HR) between the experimental group (EG) and the control group (CG) at both pre-test and post-test. The dotted trend line shows the significant difference in HR between the groups at post-test, with a ** <span class="html-italic">p</span>-value &lt; 0.01, indicating that the EG exhibited a distinct pattern in heart rate compared to the CG only in the post-test.</p>
Full article ">Figure 4
<p>Changes in LCF between the pre-test (T0) and post-test (T1) phases in both the EG and CG. Comparison of the level of cognitive functioning (LCF) between the experimental group (EG) and the control group (CG) at both pre-test and post-test. The dotted trend line shows the significant difference in LCF scores between the groups at post-test, with a ** <span class="html-italic">p</span>-value &lt; 0.01, indicating that the experimental group exhibited a distinct improvement in cognitive functioning compared to the control group only in the post-test.</p>
Full article ">Figure 5
<p>Changes in CRS-R between the pre-test (T0) and post-test (T1) phases in both the EG and CG. Legend: Comparison of the coma recovery scale—revised (CSR-R) between the experimental group (EG) and the control group (CG) at both pre-test and post-test. The dotted trend line shows the significant difference in CSR-R scores between the groups at post-test, with a ** <span class="html-italic">p</span>-value &lt; 0.01, indicating that the EG exhibited a distinct recovery pattern compared to the CG only in the post-test.</p>
Full article ">
34 pages, 1734 KiB  
Review
Artificial Intelligence in Psychiatry: A Review of Biological and Behavioral Data Analyses
by İsmail Baydili, Burak Tasci and Gülay Tasci
Diagnostics 2025, 15(4), 434; https://doi.org/10.3390/diagnostics15040434 - 11 Feb 2025
Viewed by 1116
Abstract
Artificial intelligence (AI) has emerged as a transformative force in psychiatry, improving diagnostic precision, treatment personalization, and early intervention through advanced data analysis techniques. This review explores recent advancements in AI applications within psychiatry, focusing on EEG and ECG data analysis, speech analysis, [...] Read more.
Artificial intelligence (AI) has emerged as a transformative force in psychiatry, improving diagnostic precision, treatment personalization, and early intervention through advanced data analysis techniques. This review explores recent advancements in AI applications within psychiatry, focusing on EEG and ECG data analysis, speech analysis, natural language processing (NLP), blood biomarker integration, and social media data utilization. EEG-based models have significantly enhanced the detection of disorders such as depression and schizophrenia through spectral and connectivity analyses. ECG-based approaches have provided insights into emotional regulation and stress-related conditions using heart rate variability. Speech analysis frameworks, leveraging large language models (LLMs), have improved the detection of cognitive impairments and psychiatric symptoms through nuanced linguistic feature extraction. Meanwhile, blood biomarker analyses have deepened our understanding of the molecular underpinnings of mental health disorders, and social media analytics have demonstrated the potential for real-time mental health surveillance. Despite these advancements, challenges such as data heterogeneity, interpretability, and ethical considerations remain barriers to widespread clinical adoption. Future research must prioritize the development of explainable AI models, regulatory compliance, and the integration of diverse datasets to maximize the impact of AI in psychiatric care. Full article
Show Figures

Figure 1

Figure 1
<p>ScienceDirect Data: Psychiatry and AI Topics.</p>
Full article ">Figure 2
<p>PubMed Data: Psychiatry and AI Topics.</p>
Full article ">Figure 3
<p>WOS Data: Psychiatry and AI Topics.</p>
Full article ">Figure 4
<p>MDPI Data: Psychiatry and AI Topics.</p>
Full article ">
22 pages, 5713 KiB  
Article
Impaired Prosodic Processing but Not Hearing Function Is Associated with an Age-Related Reduction in AI Speech Recognition
by Björn Herrmann and Mo Eric Cui
Audiol. Res. 2025, 15(1), 14; https://doi.org/10.3390/audiolres15010014 - 8 Feb 2025
Viewed by 342
Abstract
Background/Objectives: Voice artificial intelligence (AI) technology is becoming increasingly common. Recent work indicates that middle-aged to older adults are less able to identify modern AI speech compared to younger adults, but the underlying causes are unclear. Methods: The current study with younger and [...] Read more.
Background/Objectives: Voice artificial intelligence (AI) technology is becoming increasingly common. Recent work indicates that middle-aged to older adults are less able to identify modern AI speech compared to younger adults, but the underlying causes are unclear. Methods: The current study with younger and middle-aged to older adults investigated factors that could explain the age-related reduction in AI speech identification. Experiment 1 investigated whether high-frequency information in speech—to which middle-aged to older adults often have less access due sensitivity loss at high frequencies—contributes to age-group differences. Experiment 2 investigated whether an age-related reduction in the ability to process prosodic information in speech predicts the reduction in AI speech identification. Results: Results for Experiment 1 show that middle-aged to older adults are less able to identify AI speech for both full-bandwidth speech and speech for which information above 4 kHz is removed, making the contribution of high-frequency hearing loss unlikely. Experiment 2 shows that the ability to identify AI speech is greater in individuals who also show a greater ability to identify emotions from prosodic speech information, after accounting for hearing function and self-rated experience with voice-AI systems. Conclusions: The current results suggest that the ability to identify AI speech is related to the accurate processing of prosodic information. Full article
Show Figures

Figure 1

Figure 1
<p>Sample stimulus representations. (<b>A</b>) Sample waveforms for the same sentence spoken by a human female speaker and synthesized using Google Wavenet separately for the unfiltered original speech and the 4 kHz low-pass filtered speech. (<b>B</b>) Cochleograms [<a href="#B76-audiolres-15-00014" class="html-bibr">76</a>] of the sentences in panel (<b>A</b>).</p>
Full article ">Figure 2
<p>Hearing assessments, voice-AI experience, and speech intelligibility in Experiment 1. (<b>A</b>) Subjective (self-rated) hearing abilities (left) and problems (right). (<b>B</b>) Digits-in-noise thresholds. (<b>C</b>) Self-rated experience with voice-AI systems. (<b>D</b>) Proportion of correct word reports in the speech-intelligibility task. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> ≤ 0.05. n.s.—not significant.</p>
Full article ">Figure 3
<p>Performance in human vs. AI speech categorization. (<b>A</b>) Proportion of ‘human voice’ responses for human and Wavenet voices and for the unfiltered and filtered speech materials. Average responses across the 10 human voices and 10 Wavenet voices. Boxplots and data from each individual (dots) are displayed. (<b>B</b>) Proportion of ‘human voice’ responses, separately for the 10 human voices (5 male and 5 female) and the 10 Wavenet voices (5 male and 5 female). Error bars reflect the standard error of the mean. For human voices, abbreviations refer to the labels of original stimulus recordings [<a href="#B73-audiolres-15-00014" class="html-bibr">73</a>,<a href="#B74-audiolres-15-00014" class="html-bibr">74</a>]. For Wavenet voices, the first two letters, ‘GW’, of the abbreviation refers to Google Wavenet; the third letter, ‘F’ vs. ‘M’, refers to female and male, respectively; and the fourth letter refers to the specific Wavenet voice (<a href="https://cloud.google.com/text-to-speech/docs/voices" target="_blank">https://cloud.google.com/text-to-speech/docs/voices</a>, accessed on 1 January 2024). (<b>C</b>) Perceptual sensitivity across all voices. Boxplots and data from each individual (dots) are displayed. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> ≤ 0.05. n.s.—not significant.</p>
Full article ">Figure 4
<p>Results for follow-up experiment and comparison to results from Experiment 1. (<b>A</b>) Proportion of ‘human voice’ responses for human and Wavenet speech and perceptual sensitivity (d-prime) for data from Experiment 1. After each sentence, participants provided verbatim word reports (intelligibility), followed by the speech-type categorization (human vs. computer-generated). (<b>B</b>) Same as for panel (<b>A</b>). In the follow-up experiment, after each sentence, participants categorized the speech first (human vs. computer-generated) and then provided verbatim word reports (intelligibility). The results are very similar. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> &lt; 0.05. n.s.—not significant.</p>
Full article ">Figure 5
<p>Hearing assessments and voice-AI experience. (<b>A</b>) Subjective (self-rated) hearing abilities (left) and problems (right). (<b>B</b>) Digits-in-noise threshold. (<b>C</b>) Self-rated experience with voice-AI systems. In Panels (<b>A</b>–<b>C</b>), boxplots and data from each individual (dots) are displayed. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> ≤ 0.05. n.s.—not significant.</p>
Full article ">Figure 6
<p>Performance in speech intelligibility and human vs. AI speech categorization. (<b>A</b>) Proportion of correctly reported words in the speech-intelligibility task. (<b>B</b>) (left) Proportion of ‘human voice’ responses for human and Wavenet voices. Average responses across the 6 human voices and 6 Wavenet voices. (Right) Proportion of ‘human voice’ responses, separately for the 6 human voices (3 male and 3 female) and the 6 Wavenet voices (3 male and 3 female). Error bars reflect the standard error of the mean. For human voices, abbreviations refer to the labels of original stimulus recordings [<a href="#B73-audiolres-15-00014" class="html-bibr">73</a>,<a href="#B74-audiolres-15-00014" class="html-bibr">74</a>]. For Wavenet voices, the first two letters, ‘GW’, of the abbreviation refers to Google Wavenet; the third letter, ‘F’ vs. ‘M’, refers to female and male, respectively; and the fourth letter refers to the specific Wavenet voice (<a href="https://cloud.google.com/text-to-speech/docs/voices" target="_blank">https://cloud.google.com/text-to-speech/docs/voices</a>, accessed on 1 January 2024). (<b>C</b>) Perceptual sensitivity across voices. Boxplots and data from each individual (dots) are displayed. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> ≤ 0.05. n.s.—not significant.</p>
Full article ">Figure 7
<p>Results from the emotion-categorization task and correlation with speech-categorization performance. (<b>A</b>) Intelligibility for target words in the emotion-categorization task. (<b>B</b>) Performance in the emotion-categorization task. (<b>C</b>) Correlation between emotion-categorization performance and speech-categorization performance (human vs. computer voice), including all participants (<b>left</b>), younger adults (<b>middle</b>), or older adults (<b>right</b>). Data reflect the residuals after regressing out age group (<b>left</b>), digits-in-noise threshold, and voice-AI experience. The solid line reflects the best linear fit and the dashed lines the confidence interval. The label ‘mid-to-older’ refers to the group of middle-aged to older adults. * <span class="html-italic">p</span> ≤ 0.05.</p>
Full article ">
Back to TopTop