Research
Open access
Published: 22 November 2024

Multimodal machine learning for language and speech markers identification in mental health

Georgios Drougkas¹,
Erwin M. Bakker¹ &
Marco Spruit^1,2

BMC Medical Informatics and Decision Making volume 24, Article number: 354 (2024) Cite this article

394 Accesses
1 Altmetric
Metrics details

Abstract

Background

There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.

Methods

For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.

Results

Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.

Conclusions

In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.

Peer Review reports

Background

This work expands upon Spruit et al. [1] ‘Exploring language markers of mental health in psychiatric stories’ as a main source of inspiration for our unimodal text model, both in regards to the methodology and in identifying language markers for various mental health disorders. Furthermore, Cho et al. [2] ‘Review of Machine Learning Algorithms for Diagnosing Mental Illness’ provides a thorough review of various machine learning algorithms in the field of diagnosing mental health disorders. According to their review, SVM and RF clearly outperformed simpler models like Naıve Bayes and KNN during this diagnosis. The authors claim that the SVM model has been employed before for all domains in mental health and it has been revealed that it normally achieves more that 75% accuracy. Other relevant works to support our selection of baseline classifiers were [3,4,5,6,7]. Specifically, the study performed by Assan et al. focused on the detection of depression by exploring a big range of machine learning classifiers. Among the various explored machine learning classifiers, SVM, Random Forest and Logistic Regression were also examined [6].

Concerning the feature extraction, papers [1, 3, 6, 8,9,10] demonstrated the strengths of Linguistic Inquiry and Word Count (LIWC) for the text model. LIWC is a toolkit of Natural Language Processing (NLP) techniques to calculate the number of words of certain categories that are used in a text based on a dictionary [1, 33]. Similarly, other related works (e.g. [11,12,13]) extracted GloVe embeddings and Mel-Frequency Cepstral Cooefficients (MFCCs), for the text and audio models, respectively. The GloVe model is a well known vector space representation of a global word-word cooccurrences matrix [34], while MFCCs are well known for their ability to detect emotions from acoustic signals [35].

Another interesting research was performed by Yazdavar et al, who present the positive values of early fusion, where features from different modalities are concatenated into a single feature vector as input for a model. Specifically, they explain how early fusion is less computationally expensive than late fusion and how their model reduces the learning effort and has shown promising results [3]. A recent comparative analysis of early versus late fusion for multimodal prediction models also demonstrates that early fusion is the best option with model knowledge [36].

In Table 1, we are presenting our findings with regards to mental health markers that can be identified from textual data. The identified markers are applicable to a wide range of mental disorders, as has been shown in the listed studies. This comprehensive list of language markers were gathered through an extensive research and literature review.

Table 1 List of identified language markers

Full size table

Similarly, Table 2 presents the most prominent speech markers for various mental disorders that have been studied in the respective listed papers (Sources) which we identified through an extensive research and literature review.

Table 2 List of identified speech markers

Full size table

The identified linguistic and acoustic markers were systematically derived from previous empirical studies on the particular field. Each of these studies examined how specific features in these two modalities correlate with underlying mental health illnesses. Based on all of these related works, which are cited in Tables 1 and 2, the identified markers were selected based on their ability to capture fundamental symptoms and behavioral characteristics associated with each mental health disorder. For instance, schizophrenia-alike disorders are linked to disorganized speech patterns and a lack of coherence, which match with the disorder’s characteristic cognitive disturbances. Similarly, markers such as self-focused language (increased use of first-person singular pronouns) in depression reflect an increased self-focus, which has been proven to be a common depressive symptom. In the case of the speech modality, acoustic features, such as reduced pitch variability in major depressive disorder or heightened jitter and shimmer in anxiety disorders, have been previously identified as indicative of the psychomotor and physiological changes that often accompany these conditions. By identifying these markers, researchers aim to map observable speech and language features with the cognitive, emotional, and neurological dysfunctions that are characteristic of each mental disorder. This connection not only enhances our understanding of the clinical presentation of these conditions but also provides a foundation for the development of objective, speech-based diagnostic tools. Each marker is considered relevant because it taps into key symptomatic domains of the disorders, as demonstrated through robust statistical analyses and cross-study validations in our literature. This is the first comprehensive overview of Linguistic and Speech markers for Mental Health Disorders.

Methods

Aim

The main objective of this paper is to study multimodal machine learning in regards to identifying mental health disorder markers, and consequently compare it with the involved unimodal approaches, in order to find out if the multimodal approach can outperform the unimodal ones. The marker identification is performed for a wide range of mental illnesses and to properly assess the performance of all approaches, we employ the four different machine learning models Support Vecor Machine (SVM), Random Forrest (RF), Logistic Regression (LR), and Fully Connected Neural Networks (FCNN, identified by ‘Dense Layers’) (as baselines) and we evaluate their performances with four specific evaluation metrics. Moreover, to make the most out of this comparison, we perform an extensive feature extraction, including more than 150 textual and acoustic features (including the aggregated statistics). All source code of our experiments is publicly available on GitHub at https://github.com/George-Drg/Multimodal-vs.-Unimodal-approaches-for-identifying-mental-disorder-markers/tree/main.

Data overview

Concerning our dataset, we managed to acquire the DAIC-WOZ dataset, which is a set of clinical interviews designed specifically to help diagnose depression, PTSD and anxiety [32]. The dataset includes interviews taking place between a human-controlled, virtual interviewer called ‘Ellie’ and a number of real participants. Three modalities are present in this database, namely transcript, audio and video. Although, the list of mental disorders included is limited compared to ours, it is still the most suitable dataset that is publicly available.

The dataset includes 189 recorded interviews, one corresponding to each participant, with an average length of 16 minutes. Specifically, the minimum and maximum voice recording lengths are 7 minutes long and 33 minutes long, respectively.The dataset included both the audio files and their corresponding transcripts. Pleasantly, the audio files were all standardized already at a sampling rate of 16 kHz. Moreover, the authors used the Patient Health Questionnaire 8 to assign some binary labels that classified the data with the presence or absence of depression. Another positive attribute of the dataset was that it came along with some extracted formant features.

A downside of DAIC-WOZ is its relatively small size. Regarding this, it was unfortunate that we had to exclude one of the 189 samples, as it included a lot of noise to the point that neither noise reduction methods, nor external programs like Audacity could help fix it.

Regarding the DAIC-WOZ dataset, one can acquire it by applying for it through the official website, which falls under the ownership of the University of Southern California (https://dcapswoz.ict.usc.edu/).

Methodology concept

Having identified, through literature research, a variety of mental illnesses’ markers and having acquired the dataset, the next step was to create three different models, one unimodal text model, one unimodal acoustic model and one multimodal model that combined both. This setup would enable the comparison between the unimodal and multimodal approaches. For the sake of a fair comparison, we trained the same machine learning models (i.e. SVM, Random Forest, Logistic Regression and a FCNN (Dense Layers)) through each approach and we evaluated their performance using four particular evaluation metrics, namely accuracy, AUC-ROC score and two variances of the F1 score (F1 of positive cases (F1 of 1s) and F1 of negative cases (F1 of 0s)).

Unimodal text model methodology

In Fig. 1, the left flowchart illustrates the methodology that we followed for the text model from this point on.

Data preprocessing

The first step was to choose what features would be extracted, in order to preprocess the data accordingly. On that note, the selected features for the text model included Linguistic Inquiry and Word Count (LIWC) categories, GloVe embeddings, K-means clustering and Part of Speech Tags (POS-Tags) counts.

LIWC is a well known analysis tool within the field of cognitive psychology to analyze texts linguistically.

Given the structure of the transcripts available by the DAIC-WOZ dataset, where the speaker is identified in the ‘speaker’ column (can be either “Participant” or “Ellie”, the virtual agent), we modified the script so that it selectively kept only the responses given by the participant. Then, we concatenated all the responses of a single participant in a single string; this way we ended up with 189 strings with an average length of 7306 words. Moving on, to address clustering, we performed the elbow method to determine the optimal number of clusters for our data [30]. Given a range of cluster numbers, this method plots the sum of squared distances of samples to their closest cluster center. Then, by examining the plot, we identified that point in which the decrease acutely changes, similar to looking at an elbow at a ninety degree hold. The particular point is the one that most often represents the optimal number of clusters. The resulting plots indicated that the most suitable number of clusters is either 4 or 5, where the inertia showed the smallest decrease. Furthermore, we confirmed the optimal number of clusters found using the Elbow method with the Silhouette method [31]. Finally, the text preprocesing that was done for the GloVe vectors involved the removal of english stopwords, tokenization and lower-casing.

Feature extraction

Once the transcripts were processed, we performed feature extraction. Since the LIWC analysis provides scores for a wide range of categories and not all of them are relevant to our objective, we decided to map the categories with the identified markers and proceed only with the relevant ones. Table 3 illustrates this mapping process and the selected categories.

Table 3 Mapping language markers to LIWC categories

Full size table

‘<Not mappable>’ refers to some language markers that could not be correlated to a specific category and we are including them in the table only for demonstration purposes. Moreover, since this table provides only a sample of the mental illnesses included in our research, some other relevant categories are not present. This refers to the anx, we, you, they, bio and relativ categories.

The LIWC features, particularly, were present in every study that was related with the combination of text modality and mental health disorders. The overall purpose of these features is analyzing word use on semantic, emotional and syntactic levels. LIWC provides categorical scores based on predefined dictionaries and it offers linguistic and psychological context of words.

In the case of clustering, each of the created clusters represents a group of texts that have similar LIWC metrics with each other. Using the average values of these metrics as our criterion, we deduced the dominant characteristics of each cluster’s text segments. At this point, the extracted information can be used for both comparative analysis (for instance, noticing that a cluster has significantly higher scores in posemo (positive emotions) can mean that it has a more positive tone compared to others) and for contextual understanding (i.e. drawing hypothesis or identifying patterns based on what the clusters represent). The particular figure (Fig. 2) illustrates the comparison of average scores across different LIWC categories for the four identified clusters (0, 1, 2 and 3). For a certain cluster, each bar shows the normalized average score within a particular LIWC metric. For instance, by observing the scores on the ‘sad’ category, it is noticeable that clusters 0 and 2 display negative average values, indicating a lower presence of language associated to sad themes. Cluster 1, on the other hand, shows an exceptionally high average score, which suggests that the participants or text samples within this cluster are more involved with sadness-related themes and language. The ‘negemo’ category displays a similar trend, where cluster 1 shows one of the highest positive averages. This alignment indicates that the contents of cluster 1 are notably more negative compared to the other three clusters. Overall, the figure highlights how the four distinct clusters differ in their emotional and linguistic patterns and it is shown that cluster 1, specifically, consistently scores higher in categories such as ‘sad’ and ‘negemo’, which are associated with negative affect.

For the GloVe embeddings features, first we downloaded a pre-trained GloVe model, particularly the 100-dimensional (100d) GloVe model, which not only offers a good balance between providing sufficient semantic detail for nuanced text analysis and being computationally efficient (not intensive like the higher-dimensional models), but also provides enough depth in the word embeddings without overfitting. The model we selected is Wikipedia 2014 + Gigaword 5 that covers a broad range of topics with additional diversity, which is essential in capturing embeddings with terms relevant to mental health contexts. Then the GloVe embeddigns are loaded into python by creating a dictionary, where the keys are words and the values are the corresponding vector representations. GloVe embeddings are especially useful when working with machine learning models, since they can not only improve the model’s performance but also perform dimensionality reduction. Hence, in the context of identifying language markers for mental disorders, GloVe embeddings can be used to perform a more nuanced and morphological analysis of the used words. Their value is also proven by their presence in a big number of related work papers.

In order to complement our feature set further, we also extracted some information on relevant POS-tags. More specifically, we applied tokenization and part-of-speech (POS) tagging in the cleaned text files, using the nltk package. Through this, we downloaded the necessary components for tokenization and POS tagging. Table 4 shows the selected POS-Tags counts that we used in this study.

Table 4 All the relevant POS-Tags that were extracted as features in this study

Full size table

Furthermore, apart from the relevant tag counts (Table 4), we also extracted counts for first-person singular, third-person singular and third-person plural pronouns, since this categories are particularly relevant to this research (refer to Table 5).

Table 5 Additional counts focusing on self-focused language and 3rd pesron singular/plural pronouns. –SG stands for Singular and PL for Plural

Full size table

The inclusion of POS-Tag counts as features was a novel addition from our side. Unlike the previous linguistic features, the particular features do not have an established clinical basis within the existing literature on mental health disorders. However, based on our review, we observed that certain mental disorders are frequently associated with linguistic markers such as self-focused language or the use of third-person plural/singular pronouns. Therefore, we hypothesized that including a feature that can keep track of and count the pronoun instances could enhance our ability to capture these patterns and provide valuable insights to our research.

Normalization

With feature extraction complete, the next step was to normalize our features, in order to ensure a consistent scale and consequently improve the comparability of the data. This entails making the data more homogeneous and possibly revealing new patterns in the data that weren’t evident before. For the normalization task, we applied Z-score normalization, also known as standard scaler or standard normal distribution. This method is a specific type of probability distribution that standardizes the data to have a mean of 0 and a standard deviation of 1 for each feature.

Standard Scaler Formula:

$$\begin{aligned} x_{standard} = \frac{x - mean(x)}{std(x)} \end{aligned}$$

Binary label creation

As part of our research for identifying potential mental disorder markers, we had to create a binary label that would serve as an indicator of our models’ performance. Hence, we designed an intuitive and pragmatic labelling process, which emphasizes practicality and common sense to establish a scientific baseline. In order to cover for any lacking points and to fortify its validity, we performed a long series of experiments before creating the final set of binary labels. For this process we used our main dataset (DAIC-WOZ) and we focused on the majority of the mental disorders studied in this paper (i.e. Depression, Bipolar Disorder, Schizophrenia, PTSD and even ASD).

The first step was to revisit Table 3 and select particular LIWC categories by narrowing down to the most relevant items. The selected categories include anx, bio, cogproc, death, i, negemo, posemo, relig, sad, social and they. The plan was to create a set of thresholds that would help us to classify between the presence or absence of a marker based on whether the value of a threshold is exceeded or not. To discern the optimal values for the LIWC categories’ thresholds we extracted the statistical summaries and created the distribution plots for each chosen LIWC category. By observing and comparing those normalized categorical scores, along with studying each category’s Skewness and Kurtosis, we created the first experimental sets of thresholds. For instance, if category ‘sad’ had a mean value of 1.8 then we created its threshold accordingly. Knowing the skewness (asymmetry of the distribution) and kurtosis (mean outliers) also played a big part in setting thresholds. It helped ensure that the marker identification criteria are not only based on central tendency but also on the distributions’ tails. Particularly for categories with high values on these two measures, setting thresholds to capture extreme values ensures that the labeling process is more likely to identify instances that stand out significantly from the norm.

Utilizing all the information obtained up to this point, we created 10 initial threshold sets. Some of the sets were sensitivity focused, meaning that they had lower thresholds that aimed to capture even less obvious cases and other sets were more specificity focused, meaning that they were stricter and aimed to capture the more extreme cases. Table 6 illustrates the first experimental sets and the ‘Marker Presence’ column refers to the percentage of samples that were flagged as including a language marker.

Table 6 Initial sets of thresholds; Characteristic, as the name implies, refers to the characteristics of the respective sets and Marker Presence reflects the number of positive cases

Full size table

Before refining our thresholds, we examined the previously assigned PHQ-8 binary labels. Although, those labels are particularly focused on depression only, we used them as a baseline. The PHQ-8 labels had flagged approximately 30% of the samples as diagnosed with depression and we considered this percentage a good balance between positive and negative cases. As such, we aimed to created a set of thresholds that not only balanced sensitivity and specificity but also achieved a marker presence label assignment close to 30%. On that regard, seeing how Set 8 (from Table 6) achieved the closest marker presence percentage to the PHQ-8 labels, we used it as our new baseline and created 5 more threshold sets. Table 7 presents the characteristics of the refined sets. With a marker presence of 31.7%, we selected set 15 for our labeling. The refined composite set’s alignment with the known prevalence suggests that it effectively captures a realistic proportion of instances with potential markers, indicating its thresholds are well-calibrated. In Table 8, we present the values of the final threshold set.

Table 7 Refined sets of thresholds based on set 8

Full size table

Table 8 Final threshold values that determined the final labeling

Full size table

Category ‘i’ has the lowest threshold because of its lower distribution. Categories ‘social’ and ‘cogproc’ have a dual threshold. Lower values on ‘social’ aimed to capture potential social withdrawal and upper values aimed to capture potential excessive social referencing. Similarly, lower thresholds in ‘cogproc’ aimed to capture disorganized thinking (for instance, for schizophrenia) and upper thresholds aimed to capture higher cognitive processing (for instance, overthinking in the case of OCD). In further research we will further investigate and validate this binary label creation process.

Feature selection

Since our complete textual features set consists of more than 150 features, which is a huge number considering the dataset’s size, we decided to use a wrapper method and select the most optimal features. We proceeded with the Recursive Feature Elimination (RFE) approach, which, as the name implies, works by recursively removing the least important features based on the weights of the model and then re-builds the model until the specified number of features is reached.

In order to identify the optimal number of features and the most suitable predictive model for the selection of these features, we experimented by tuning various parameters. Table 9 demonstrates the parameters of these experiments.

Table 9 Refining Text Feature Selection; this Table displays all the experiments that were performed towards the goal of finding the best features set

Full size table

Attempting to acquire a set of features that would eventually bring better results and that could potentially uncover some underlying patterns and relationships between features, we iterated over this process while tuning each of the parameters with respect to the recommended features subset and the modeling results achieved over each iteration. The first parameter was using RFE or RFECV. The first one performs the feature selection given a fixed number of features to select (set by us), while the latter performs the selection using cross validation and finally recommends the best number of features (along with the corresponding top features names) as judged by the assigned scores. The selection process of these methods is performed on the training data, in order to avoid any information from the test set being leaked. This ensures an unbiased evaluation of the model’s performance. In both approaches we experimented both with Logistic Regression and with Random Forest. RFE (as well as RFECV) is a model-specific feature selection method. This means that the final set of features, proposed by the process, is influenced by the characteristics and requirements of the respective model and emphasizes on maximizing its performance. Then, for each predictive model we iterated five times, setting the number of features to 10, 15, 20, 25 and 30 (in the case of RFE).

Then, the second parameter we experimented with was the inclusion or exclusion of re-scaling during the feature selection process. As seen at Table 9, all parameter combinations were tested with and without additional scaling. Moreover, it can be observed that Random Forest isn’t’ influenced at all by the presence or absence of re-scaling, while Logistic Regression provides completely different results in each case.

Every single features set was then evaluated through all of our machine learning models and we moved on with the set proposed by the best performing selector. In the case of the unimodal text model, this set was the one without cross validation, with re-scaling and that employed Logistic Regression as the classifier. Table 10 presents the top 20 features (unordered) selected by the feature selector with these parameters.

Table 10 Top 20 text features recommended by Recursive Feature Elimination (RFE) with the Logistic Regression model and with re-scaling

Full size table

Modeling

As the left flowchart of Fig. 1 illustrates, the final step of the unimodal text model’s methodology was testing the models’ performance. To do that, we used the aforementioned features set as input to our four ML models and we evaluated its performance using the four predefined evaluation metrics.

Unimodal acoustic model methodology

The methodology for the unimodal acoustic model follows tightly the unimodal text model’s methodology, which was just discussed.

Data preprocessing

Once again, the initial step of the methodology involves data preprocessing, and particularly noise reduction and audio file segmentation. First, we performed noise reduction, specifically spectral subtraction, on all our audio files. The particular noise reduction method works well for constant noise like hums or hisses, while still preserving the quality of speech. Then, to verify the method’s successful application, we used the Audacity software and listened through 15 to 20 samples. Through this testing, we noticed that audio file #300 included a particular noise that couldn’t be cleared away, and hence it was removed from our dataset along with the corresponding transcript.

Once all of our audios were properly noise reduced, we segmented those files into smaller chunks, which would be more easily manageable and could potentially offer more nuanced information during the feature extraction process. Smaller segments allow for more precise calculations of features such as pitch, jitter, shimmer, and MFCCs, which can lead to a better understanding of the audio characteristics relevant to mental health markers. The segmentation was done based on volume and with the addition of some buffering. The reason why this segmentation method was selected is due to the fact that through Audacity it was noticed that the participant’s speech was considerably louder compared to that of the interviewer. Additionally, the extra buffering was used so that the segmented chunks would include complete sentences and there wouldn’t be any loss of context.

Feature extraction

Regarding the feature extraction, we used the librosa and praat-parselmouth libraries in order to extract pitch, jitter, shimmer, Harmonics to Noise Ratio (HNR), 13 MFCC coefficients and energy. Specifically, for jitter, we extracted jitter local, jitter ppq5 and jitter absolute and for shimmer we extracted shimmer local and shimmer apq5. All of these features were extracted on the segment level, as feature extraction was applied on every single segment of every interview.

On the other hand, the authors of the dataset had also provided the first five formant features, which were extracted on the interview level. Hence, at the early stages of this study we worked with two different sets of features; one at the interview level and one at the segment level. Regardless, in both sets we also extracted the mean, median and standard deviation statistics for every feature.

The feature selection for the audio model was based on the proposed features throughout the literature and the related work. Moreover, as demonstrated by the marker tables in the ‘Background’ section, most of the extracted acoustic features have been identified as related with one or more mental disorders. Among all those features, pitch was one of the most popular with a presence percentage of about 90%. It has been described as a good indicator in mental health disorder identification, specifically for identifying emotional states or stress. The fundamental frequency F0 mean, which indicates the lowest frequency of the speech signal is perceived as pitch (mean, median). Jitter and shimmer were also very popular features across the literature review. Overall, jitter calculates the voice frequency’s stability and is often chosen as a measure for the detection of voice disorders or vocal pathologies. In short, jitter shows the variations in pitch. In the case of shimmer, it is actually an evaluation measure for the stability of amplitude in the voice. Like jitter, shimmer is critical in diagnosing and researching voice quality. Lower Harmonics to Noise Ratio (HNR) indicates a breathy, hoarse voice, which reflects reduced vocal clarity and is also associated with various mental illnesses. Finally, MFCC coefficients capture spectral features of the voice and have been used to identify disorders such as schizophrenia and PTSD, where vocal tract dynamics and speech patterns are affected.

Normalization

Moving on, using the same formula as for the text model, we normalized all of our features (including the formants set) and their statistics. This offers consistency on the features’ values and makes it easier to analyze and compare.

Feature aggregation

In order to be able to compare all of our features and perform feature selection properly, we harmonized all features to the same unit of analysis, i.e. the interview level. To achieve that, we aggregated all of our segmented-level features back to the interview level by extracting the mean, median and standard deviation of all the segments’ values per interview.

Example:

To clarify the process, let’s take pitch as an example. So, for pitch we had already extracted mean, median and std on the segment level. Following the above aggregation method we got the mean of the pitch_mean values across all segments (which is actually the average of the averages), the median of those pitch_mean values (central tendency of pitch variation across segments) and also the standard deviation (pitch variability) among those values within an interview. Similarly, we got the mean, median and std of the pitch_median and pitch_std values of all segments within each interview.

Feature aggregation eventually led to a complete features set that merged all the extracted acoustic features and their statistics; with everything expressing a single overall value for each interview.

Binary label assignment

Although the creation of the binary labels was completely based on LIWC features (purely textual), we still assigned the same labels to our audio data. This is because the transcripts on which we implemented the process are directly linked with the corresponding audio files.

Feature selection

The final acoustic features set exceeded 150 feature columns and as such we proceeded with the same feature selection methodology as we did with the unimodal text model. The exact same experiments are performed and the selector with the best performance in this model is RFE with Random Forest.

Table 11 illustrates the top 20 features (unordered) proposed by the RFE with Random Forest selector.

Table 11 Top 20 audio features recommended by Recursive Feature Elimination (RFE) with the Random Forest model

Full size table

Modeling

As indicated by the flowchart (see Fig. 1), the features of Table 11 were then used to train our models.

Multimodal model methodology

For the implementation of the multimodal model we proceeded with early fusion (also known as feature level fusion). The particular fusion technique involves the concatenation of features from different modalities into a single feature vector, before feeding them as input into the machine/deep learning models. Although simple, early fusion is a straightforward and less computationally intensive approach that doesn’t require sophisticated synchronization between the two modalities.

To move on with this modality concatentation, first we created two sets of features from each unimodal model. From the textual model we created a set with 20 features and one with 15 features and from the acoustic model we extracted a set of 15 features and another with 10 features. The difference in the feature sets sizes is due to the preliminary results, which indicated that the acoustic model performed obviously worse than the text model.

Then using this information and the 4 feature sets, we concatenated them into three multimodal features sets; each with a different approach. The three multimodal sets and their details can be found in Table 12.

Table 12 Refining multimodal feature selection

Full size table

The features from each unimodal model were selected based on the top 20 features proposed by the best performing feature selector of the corresponding modality; i.e. RFE with Logistic Regression and rescaling for text and RFE with Random Forest for acoustic.

Similar to the unimodal methodologies, the three newly created features sets were trained in the same 4 machine/deep learning models and were then evaluated using the same metrics.

For all three modality approaches we used cross-validation, instead of the typical train-test split. Not only is cross-validation (K-fold) more reliable but it also helps to efficiently assess the models’ effectiveness and prevent them from overfitting. To illustrate, Fig. 3 visualizes both the unimodal (Fig. 3a) and multimodal (Fig. 3b) configurations for the neural network-based models.

Results

In this section we present the performance of the Unimodal Text Model, the Unimodal Acoustic Model, and the Multimodal Model, respectively.

Unimodal text model

Features discussion

When the feature elimination and selection process were complete, we observed that certain features stood out across all methods and settings. First of all, we noticed a consistency in the appearance of the ‘anx’, ‘sad’, ‘they’ and ‘death’ LIWC categories. This indicates that these psychological and thematic aspects of the text are highly relevant to the identification of mental health disorder markers. Their consistent presence underscores the significance of emotional and thematic content in the analysis. Apart from LIWC categories, there were also some GloVe dimensions that were repeatedly selected. This consistency of certain GloVe dimensions suggests that they capture key semantic features relevant to the identification of language markers associated with mental health diseases. Another feature that was prevalent, across the various feature selectors, was PCA2. In the context of Principal Component Analysis (PCA), the second principal component (PCA2) accounts for the next highest variance after the first principal component (PCA1). The fact that PCA2 appears more than PCA1 implies that PCA2 captures significant aspects of the data that are not captured by PCA1. Finally, concerning our last type of textual features, POS Tag counts, we noticed that ‘VBG_count’ (verb, gerund or present participle) and ‘JJR_count’ (comparative adjective) both belonged to the top 15 features selected across all feature selectors. The frequency of the first one points to the syntactic structures of sentences as informative features, while the frequency of the latter suggests that certain grammatical constructs may play a role in distinguishing texts related to mental health. Overall, features that appear consistently tend to be less sensitive to variations in the modeling process or data sampling, making them reliable choices for critical analyses.

Our analysis indicated that the most selected feature overall (across all of our feature selectors) was the LIWC category ‘death’. This consistency obviously indicates its importance when it comes to the identification of mental health disorder markers.

Results

In Table 13 we demonstrate the results achieved by our models, using the features set recommended by the best performing selector approach; i.e. RFE with Logistic Regression and with additional scaling. As discussed earlier, k-fold cross-validation was implemented to our models, and since every fold of the cross validation produced a different accuracy and AUC-ROC score, we modified our script so that it returned the mean cv values of all folds. As such, in Table 13, ‘Accuracy’ represents the mean CV accuracy and ‘AUC-ROC’ represents the mean CV AUC-ROC score across all folds. Moreover, ‘F1 - 0s’ and ‘F1 - 1s’ represent the F1 scores achieved when predicting the absence and the presence of mental disorder markers, respectively. For the three machine learning models we used a random state seed, so regardless of how many times we reran the models the results did not deviate much from the ones presented in the table. On the other hand, in the case of the deep learning model, every run produced different results (with observable variance) and as such we are providing the best scores out of four runs for each set of features.

Table 13 Unimodal Text results based on Recursive Feature Elimination (RFE) with the Logistic Regression model and with re-scaling

Full size table

Based on the presented results, it was indicated that each model works best with a different number of features. SVM with a linear kernel and Logistic Regression both perform best with 20 features and it is also clear that they achieve similar scores in all metrics. On the other hand, when tested with 25 features the performance drops and when tested with 30 features the performance dropped even further. This means that the presence of overfitting became more intense and that the addition of more features was redundant. In the case of Random Forest and Dense Layers, the models performed their best across all metrics (with a small exception at the accuracy of the Dense Layers) when given 25 features. Both Random Forest and Dense Layers benefit from having more features. As an ensemble learning method, Random Forest, can use the larger number of features to create more informative splits across its decision trees, which also explains the considerably better score at the AUC-ROC score. Similarly, neural networks can use larger sets of features to learn more complex patterns because of their capacity to handle and learn from additional features. However, when tested with a set of 30 features (using the same feature selector), every model under-performed in every single metric. This implies that after a specific point the models cannot generalize as well.

Overall, while testing for overfitting by comparing the train and test set metrics, we noticed some mild overfitting (on average about 5% gap between the train and test set performances). However, due to all the mitigation measures taken and the use of cross validation, we deem this modest amount of overfitting acceptable in this case.

Unimodal acoustic model

Features discussion

In the case of the unimodal acoustic model’s feature selection, we noticed that across all different methods and settings, certain features (particularly from the MFCCs and formant features) show a consistency in appearing as significant. This suggests that there may be an important relationship between these features and the target variable across both linear (LogReg) and non-linear (RF) model perspectives. Another notable fact that we noticed is that pitch-related features were more prominently selected by the RF model and specifically with the RFE method. This indicates that the relationship between pitch features and the target variable can probably be captured more effectively compared to linear models (at least in some contexts). Further weight was given in the research of pitch-related features, as they were discussed a lot during related projects. This, in association with the lack of pitch-related features selected by the Logistic Regression model, lead us to conclude that the linear nature of the particular model may not always capture the complex ways in which the specific features contribute to the specific classification task. It’s probably why, Random Forest, being a non-linear model, might be better in capturing such complexities and interactions; for instance if pitch interacts with other features in a way that doesn’t lend itself to linear separation.

Results

The results shown in Table 14 represent the performance of our models when tested on the features of the RFE with Random Forest selector. The columns of the table represent the same values as the columns of the corresponding table (Table 13) of the unimodal text model.

Table 14 Unimodal Audio results based on Recursive Feature Elimination (RFE) with the Random Forest (RF) model

Full size table

By observing the results it’s noticeable that the models performed best with the same feature set sizes, as in text. SVM had the best results with the 20 features set, but it did surprisingly bad in the case of predicting marker presence (F1 score of 1s). With 10 and 15 features, SVM actually had a score of 0 on this metric. This can be attributed to various reasons, like overfitting or feature selection impact. It’s possible that SVM might be overfitting to the majority class (ignoring the minority class entirely), or that the feature selection might not be sufficiently informative for the particular model to distinguish between classes. Logistic Regression, although comparable with SVM at the other metrics, managed to achieve approximately double the score of SVM’s ‘F1 score - 1s’. Dense Layers did better than SVM and LogReg, especially with 10 features and it actually achieved the best score on ‘F1 - 1s’, along with RF. Both LogReg and Dense Layers performed the best with the 10 features set. Finally, the RF model got the best scores with the 15 features, with the single exception of ‘F1 - 1s’, which appeared to be higher with less features (10). It was also interesting how RF performed the best across all models and metrics. We hypothesize that this is due to the fact that the feature sets used were picked by the selector that used RF as its classifier. It is noteworthy to mention that the particular results of RF (on 15 features) were the highest results (for every single evaluation metric) achieved by the unimodal audio model across all feature selectors.

For reference, unlike the unimodal text model, the audio model under-performed with any additional amount of features (tested with 25 and 30 features as well). This indicates that our models cannot handle and learn the same with the acoustic features, as they did with the textual ones.

Multimodal model

Table 15 presents the results achieved on the concatenated features sets. In the ‘Features’ column, ‘t’ stands for textual features and ‘a’ for acoustic.

Table 15 Multimodal results based on combined text-audio feature selectors

Full size table

The presented results indicate that the ‘20t, 10a’ features set works great with our models. In all cases, with a couple of exceptions, the particular set pulls off the highest results. The exceptions include the AUC-ROC score achieved by the ‘20t, 15a’ features set on the SVM model, which is the highest one observed across all cases, and the case of the Dense Layers. Dense Layers appear to work best (overall) with the ‘20t, 15a’ set, although the score gap with the other two sets isn’t that noteworthy. The ‘20t, 15a’ features set is the largest one out of the 3, consisting of 35 features, instead of 30. This could possibly mean that Dense Layers can be effectively trained on larger feature sets and even improve their attained results further.

Table 16, presents a comparison between the three modality approaches, for each model, using the best performing features in each case.

Table 16 Results comparison between Unimodal and Multimodal approaches, per ML model

Full size table

Just like with the unimodal approaches, the RF model shows high levels of overfitting here as well. We discovered that the RF model achieves a 100% accuracy and AUC-ROC scores on the train sets, while the results presented on Table 15 are the ones of the test set. This gap is a clear indicator of the presence of overfitting. Unlike RF, SVM only indicates minimal overfitting on the ‘20t, 15a’ and ‘20t, 10a’ features sets and mild overfitting on the ‘15t, 15a’ set. Logistic Regression overfits slightly as well, but not to the point that it is negatively affected.

Discussion

Results

Observing the performances of the two unimodal approaches, it is evident that the unimodal text model is by far outperforming the unimodal audio model. More importantly, this gap is even more intense in some of the classifiers. In the case of SVM and Logistic Regression, the text model achieves an $\approx 18\%$ higher accuracy and $\approx 32\%$ higher AUC-ROC score than the audio model, while the F1 score of 1s is higher by more than 60%. Similarly, for Dense Layers the audio model attains an accuracy $\approx 15\%$ lower and an AUC-ROC $\approx 20\%$ lower than the text model. On the other hand, in the case of Random Forest, which shows the best performance for the audio model, the gap in accuracy is only $\approx 12\%$ and the gap in AUC-ROC score is barely 10%. Concerning the F1 score of 1s metric however, we can still observe a performance difference of more than 20%.

However, this huge performance gap between the two unimodal models can most probably be attributed to the way that the binary labels were created. Since that process was based on the LIWC categories, which are a text feature, it makes sense for the text features to be more accurate during predictions and for the audio features to encounter some difficulties.

It’s not rare for text features to outperform audio features in tasks related to the particular topic. When it comes to identifying mental disorder markers, the text modality has proven to be extremely capable of leading to better predictions, even more so when there is a relevant textual content that offers clear linguistic markers. Although the text model significantly outperforms the audio model, we should still not diminish the value provided by the audio features. Analyzing these features can help discover distinct and complementary insights and this is where multimodal models, that utilize both of these modalities, can shine. By combining the strengths offered by each modality and implementing fusion techniques that bring forth those strengths, it is possible to capture more comprehensive information and details of mental health states.

As indicated by the results on Table 15, the best scores are attained by the 20t, 10a set of features, with a few exceptions appearing with the 20t, 15a set. However, there is not a single instance of the 15t, 15a features set pulling off the best results over both of the other two sets. It is evident that assigning additional weight over the textual features brings forwards a boost in the models’ performance. By emphasising more on the text features our multimodal model learns to generalize better. This was already indicated earlier, through the big gap between the text model’s and the audio model’s results, but the latest comparison even proved this observation. Regardless, experimenting with a modality balanced features set proved helpful in realizing the flaws and the correct steps in the whole modeling approach.

Overfitting prevention

To prevent overfitting, our initial approach was to use cross-validation instead of the traditional train-test-split approach, which proved to be effective. Not only did we observe a noticeable gap in the overfitting levels between the two approaches, but also cross-validation led to minimal, or at the worst case mild, levels of overfitting. We also noticed that overfitting was more prevalent with a higher number of features, which can be associated to the small size of the dataset compared to the number of features selected. Furthermore, during the mulimodal model’s experiments, we also implemented a regularization technique to our Logistic Regression model and added Dropout layers to our Dense Layers model. Although this had only a slight impact towards the prevention of overfitting, it had a larger impact on the model’s performance. The results of the Logistic Regression model after applying regularization dropped greatly across all metrics and number of features. Additionally, the F1 scores of the positive cases decreased by 60%. This was due to the low recall rate. For each features set the model has a precision rate of 100%, while the recall rate ranges between 7% and 10%. This is why, for the case of Logistic Regression, we moved on with the pre-regularization results. The overfitting level may be slightly higher but the model’s performance was not even comparable.

Limitations

The limitations of this study can be easily realized through the future work recommendations. Everything mentioned there includes methods and techniques, whose implementations can improve this research significantly. Yet, one important limitation that isn’t mentioned is the process of creating the binary labels. As elaborated above, this process was intuitive and pragmatic but requires further refinement and improvement. Finding a way to make the process more modality neutral, would make the results less skewed towards one modality and less biased. Moreover, along with some refined feature engineering for the acoustic model, its performance could increase and consequently improve the performance of the multimodal models as well.

Another limitation to this study was the dataset selection. Although DAIC-WOZ was a great asset for our project, it did limit our research because of its size and its specificity towards a couple of particular mental disorders. Moreover, because of the dataset’s interval overlaps in the transcript we had to avoid text-speech alignment; a process that has a fundamental role in multimodality, especially between the particular two modalities. Finally, we believe that implementing models with increased complexity could help to capture more nuanced details that might have been missed by simpler models.

In this study we focused on two of the three modalities in the DAIC-WOZ dataset. As such this study is a multimodal extension (linguistic and audio markers) in line with [1] which focused on the identification of linguistic markers only. The second modality (audio) was selected based on compatibility. Transcripts and audio data are usually interconnected in most datasets, just like many of their features. Nevertheless, the inclusion of the video modality as a third provider of mental health illness markers, which is expected to capture important non-verbal indicators such as facial expressions and gestures, has great potential to further enhance the accuracy of mental diseases identification. This will be part of future research.

Challenges

One of the challenges that we overcame was avoiding the loss of modality-specific insights. When merging features at an early stage, it sometimes becomes challenging to discern which modality is contributing to predictions and this can lead to reduced model interpretability. However, by creating three different sets of combined features (one of which represented the balance between the two modalities), we showed that for the majority of the models the best performing features set was the one with the 66% weight on the textual features. Then, for some of the models, the challenger was the features set that was only slightly skewed towards the textual modality. Yet, in none of the models did the balanced features set perform better than the other two. This clearly indicates that the textual features have more insight to offer in the identification of mental disorder markers, at least on the individual level.

Risk of feature dominance posed another challenge during the early fusion. Although, giving additional weight to the better performing text features (compared to the acoustic features) proved that it increased the performance of our models, this could potentially lead to an imbalance in feature contribution. It’s possible for the text modality to have dominated the feature importance, overshadowing the audio modality and this would have led to partly biased models. To prevent that we performed modality-specific preprocessing. We followed the exact same scale-balancing (normalization) techniques in every single feature during the development of each unimodal model. By doing that we guaranteed that all features (of both modalities) were on the same scale and their impact should be balanced such that any possible feature dominance would be prevented.

Applications

The finding of this study can potentially play a very important role in the development of practical mental health screening or monitoring tools. The study provides not only an extensive list of linguistic and acoustic mental disorder markers, but also some experiments and comparison between unimodal and multimodal approaches. Through the identification of linguistic and acoustic mental disorder markers, this research opens a path towards early, non-invasive screening systems. It should be possible to develop applications, meant for both clinicians and individuals, that utilize these markers in order to monitor mental health states in real-time. These applications could work by receiving as input data of different modalities, like textual content, conversations or voice recordings, and in turn assess mental health conditions. One can consider such tools as an additional diagnostic layer in clinical settings or alternatively as part of remote mental health care, where patients can be monitored over time without requiring frequent clinical visits in person. Moreover, since markers like reduced pitch variability in depression or increased jitter and shimmer in anxiety have clear physiological correlations, they can provide objective, quantifiable data to complement traditional psychiatric assessments. This can lead to improved early detection, personalized treatment plans, and more accurate tracking of patient progress during therapy.

Future work

The most significant recommendation we have to offer, regarding the future work, is the implementation of text-speech alignment, which constitutes a significant aspect of multimodality that facilitates a more comprehensive integration of the two modalities. Implementing this process would also open more opportunities for experimenting with other fusion techniques. Although early fusion provided us with a way around text-speech alignment, it may oversimplify the interaction between the two modalities and it is less flexible in handling their distinct characteristics.

Text-speech alignment would not only enable the extraction of additional important features (speech ratio and pause duration), but also the option of experimenting with deeper deep learning models, like GRUs and LSTMs. These models excel in processing sequences where the timing and order of inputs are crucial to understanding the data’s context and dynamics. The alignment is particularly important in tasks where the synchronization of spoken words and vocal characteristics adds value to the interpretation. GRUs, for instance, are tailored to handle sequences by capturing dependencies at different time steps. In order for the GRU to effectively analyze the responses of the textual information to the speech variations (e.g. speech rate, pauses, intonation, etc), it is crucial for the two modalities to be properly aligned.

Other than the implementation of text-speech alignment we would also recommend experimenting with the $F_{0.25}$ metric, which is a variance of the $F_\beta$. $F_\beta$ is a generalization of the F1 score that allows the alteration of the original weights between precision and recall. For completeness, we provide the equation for $F_\beta$ below:

$$\begin{aligned} F_{\beta } = (1 + \beta ^2) \cdot \frac{\text {precision} \cdot \text {recall}}{\beta ^2 \cdot \text {precision} + \text {recall}} \end{aligned}$$

In the $F_{0.25}$ operationalization, $\beta$ is set to 0.25, indicating that precision is considered more important than recall. This alteration of the original F1 metric is recommended in cases where the consequences of false positives are crucial. For instance, in our project, we might prefer falsely marking a few samples as marker-including (higher precision) over missing too many actual markers (lower recall).

Furthermore, as mentioned earlier, the implementation of text-speech alignment could also open the path to more sophisticated and complex fusion techniques like hybrid fusion and end-to-end fusion. The first one combines aspects of both early and late fusion. For instance, some features may be combined at an early stage, while others may be merged after some additional individual processing. This way modality-specific processing is preserved (at least at some level) and the model can learn the distinct properties of each modality better. On the other hand, end-to-end fusion involves the integration of modalities at a deep level, with a preference for using deep neural networks, which learn to extract and combine relevant features autonomously. Following this approach, we could for example feed all 300 features into one deep neural network and have it recommend and extract the most optimal combination of features. This is even supported by the fact that this method can dynamically adjust to the modality and feature importance through training. It’s particularly advantageous when there is a more complex and highly non-linear interaction between the modalities.

It is possible that exploring these two fusion approaches could allow for a more sophisticated handling of modalities, as well as lead to a potential improvement of the accuracy and robustness of identifying mental health disorder markers.

Conclusions

In conclusion, in our study we presented, to our knowledge, the first comprehensive overview of Linguistic and Speech markers for Mental Health Disorders. We have found that the text model outperforms by far the acoustic model when identifying mental health markers. However, this may in part be due to bias in the binary label creation process. Furthermore, we observe that the multimodal model comes head to head with the text model and in some cases even outperforms it (i.e. in the F1 scores of all models except Random Forest). Even though this is not the case for every metric, taking into account the performance of the acoustic model, even this modest increase in the F1 scores indicates that the multimodal model is actually affected by the information of both modalities. This is especially so in the case of the F1 scores of the positive cases, which are more important in this study compared to the F1 scores of the negative cases. Therefore, we conclude that a multimodal model for mental health markers identification can indeed outperform unimodal approaches, while providing several research opportunities for further realizing its potential in mental health.

Data availability

The DAIC-WOZ dataset [32] used in this study is maintained by The University of Southern California Institute for Creative Technologies. The dataset can be used under license, and is available upon request from the official website at https://dcapswoz.ict.usc.edu/.

References

Spruit M, Verkleij S, de Schepper K, Scheepers F. Exploring language markers of mental health in psychiatric stories. Appl Sci (Switzerland). 2022;12(4). Article 2179. https://doi.org/10.3390/app12042179.
Cho G, Yim J, Choi Y, Ko J, Lee S. Review of Machine Learning Algorithms for Diagnosing Mental Illness. Psychiatry Investig. 2019;16(4):262–9. https://doi.org/10.30773/pi.2018.12.21.2.
Article PubMed PubMed Central Google Scholar
Yazdavar AH, Mahdavinejad MS, Bajaj G, Romine W, Sheth A, Monadjemi AH, Thirunarayan K, Meddar JM, Myers A, Pathak J, Hitzler P. Multimodal mental health analysis in social media. PLoS ONE. 2020;15(4):e0226248. https://doi.org/10.1371/journal.pone.0226248.
Article CAS PubMed PubMed Central Google Scholar
Chung J, Teo J. Mental health prediction using machine learning: Taxonomy, applications, and challenges. Appl Comput Intell Soft Comput. 2022;2022. Article 9970363. https://doi.org/10.1155/2022/9970363.
Espinola C. Detection of major depressive disorder, bipolar disorder, schizophrenia, and generalized anxiety disorder using vocal acoustic analysis and machine learning. 2022. https://doi.org/10.21203/rs.3.rs-648044/v1.
Assan J, Flannery M, Gao Y, Resom A, Wu Y. Machine learning for mental health detection. 2019. https://digital.wpi.edu/pdfviewer/b8515p953. Accessed 21 Nov 2024.
Yoo H, Oh H. Depression detection model using multimodal deep learning. 2023. https://www.preprints.org/manuscript/202305.0663/v1. Accessed 21 Nov 2024.
Calvo R, Milne D, Hussain M, Christensen H. Natural language processing in mental health applications using non-clinical texts. Nat Lang Eng. 2017;23(5):649–85. https://doi.org/10.1017/S1351324916000383.
Article Google Scholar
Zhang T, Schoene AM, Ji S, Ananiadou S. Natural language processing applied to mental illness detection: A narrative review. npj Digit Med. 2022;5. Article 46. https://doi.org/10.1038/s41746-022-00589-7.
Aleem S, Huda NU, Amin R, Khalid S, Alshamrani SS, Alshehri A. Machine learning algorithms for depression: Diagnosis, insights, and research directions. Electronics. 2022;11(7). Article 1111. https://doi.org/10.3390/electronics11071111.
Duong CT, Lebret R, Aberer K. Multimodal Classification for Analysing Social Media. 2017. https://arxiv.org/abs/1708.02099. Accessed 21 Nov 2024.
Dey J, Desai D. NLP based approach for classification of mental health issues using LSTM and GloVe embeddings. Int J Adv Res Sci Commun Technol. 2022;2022:347–54. https://doi.org/10.48175/ijarsct-2296.
Article Google Scholar
Zadeh A, Liang P, Vanbriesen J, Poria S, Tong E, Cambria E, Chen M, Morency L. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion. 2018. https://aclanthology.org/P18-1208/. Accessed 21 Nov 2024.
Shen Y, Yang H, Lin L. Automatic depression detection: An emotional audio-textual corpus and a GRU/BiLSTM-based model. 2022. arXiv. https://arxiv.org/abs/2202.08210. Accessed 21 Nov 2024.
Amanat A, Rizwan M, Javed A, Abdelhaq M, Alsaqour R, Pandya S, Uddin M. Deep learning for depression detection from textual data. Electronics (Switzerland). 2022;11(5). https://doi.org/10.3390/electronics11050676.
De Boer J, Voppel A, Brederoo SG, Schnack H, Truong KP, Wijnen F, Sommer IEC. Acoustic speech markers for schizophrenia-spectrum disorders: A diagnostic and symptom-recognition tool. Psychol Med. 2023;53(4):1302–12. https://doi.org/10.1017/S0033291721002804.
Article PubMed Google Scholar
Yin PL, Zhang L, Wu XY, Hou WS, Chen L, Tian XL, Wen HZ. Analyzing acoustic and prosodic fluctuations in free speech to predict psychosis onset in high-risk youths. In: Proceedings of the 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Enabling Innovative Technologies for Global Healthcare, 20-24 July 2020. Montreal: IEEE; 2020. p. 5575–9. https://ieeexplore.ieee.org/iel7/9167168/9175149/09176841.pdf. Accessed 21 Nov 2024.
Burback L, Brémault-Phillips S, Nijdam M, McFarlane A, Vermetten E. Treatment of posttraumatic stress disorder: A state-of-the-art review. Curr Neuropharmacol. 2023;22(4):557–635. https://doi.org/10.2174/1570159X21666230428091433.
Article PubMed Central Google Scholar
Broek E, Sluis F, Dijkstra T. Telling the story and re-living the past: How speech analysis can reveal emotions in post-traumatic stress disorder (PTSD) patients. In: Title of the Book or Conference Proceedings. 2010. pp. 153–80. https://doi.org/10.1007/978-90-481-3258-4_10.
Demouy J, Plaza M, Xavier J, Ringeval F, Chetouani M, Périsse D, Chauvin D, Viaux S, Golse B, Cohen D, Robel L. Differential language markers of pathology in autism, pervasive developmental disorder not otherwise specified and specific language impairment. Res Autism Spectr Disord. 2011;5(4):1402–12. https://doi.org/10.1016/j.rasd.2011.01.026.
Article Google Scholar
Iverach L, Rapee R. Social anxiety disorder and stuttering: Current status and future directions. J Fluen Disord. 2014;40:69–82. https://doi.org/10.1016/j.jfludis.2013.08.003.
Article Google Scholar
Coppersmith G, Dredze M, Harman C, Hollingshead K. From ADHD to SAD: Analyzing the language of mental health on Twitter through self-reported diagnoses. In: Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. 2015. pp. 1–10.
Yang Y, Fairbairn C, Cohn J. Detecting depression severity from vocal prosody. IEEE Trans Affect Comput. 2013;4(2):142–50. https://doi.org/10.1109/T-AFFC.2012.38.
Article PubMed PubMed Central Google Scholar
Low D, Bentley K, Ghosh S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Invest Otolaryngol. 2020;5(1):96–116. https://doi.org/10.1002/lio2.354.
Article Google Scholar
Bianciardi B, Gajwani R, Gross J, Gumley AI, Lawrie SM, Moelling M, Schwannauer M, Schultze-Lutter F, Fracasso A, Uhlhaas PJ. Investigating temporal and prosodic markers in clinical high-risk for psychosis participants using automated acoustic analysis. Early Interv Psychiatry. 2023;17(3):327–30. https://doi.org/10.1111/eip.13357.
Article PubMed Google Scholar
Garoufis C, Zlatintsi A, Filntisis P, Efthymiou N, Kalisperakis E, Garyfalli V, Karantinos T, Mantonakis L, Smyrnis N, Maragos P. An unsupervised learning approach for detecting relapses from spontaneous speech in patients with psychosis. In: 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). Athens; 2021. https://eprevention.gr/an-unsupervised-learning-approach-for-detecting-relapses-from-spontaneous-speech-in-patients-with-psychosis/. Accessed 21 Nov 2024.
Marmar C, Brown AD, Qian M, Laska E, Siegel C, Li M, Abu-Amara D, Tsiartas A, Richey C, Smith J, Knoth B, Vergyri D. Speech-based markers for posttraumatic stress disorder in US veterans. Depression Anxiety. 2019;36(7):607–16. https://doi.org/10.1002/da.22890.
Article PubMed Google Scholar
Fusaroli R, Lambrechts A, Bang D, Bowler D, Gaigg S. Is voice a marker for autism spectrum disorder? A systematic review and meta-analysis. 2016. https://doi.org/10.1101/046565.
Von Polier G, Ahlers E, Amunts J, Langner J, Patil K, Eickhoff S, Helmhold F, Langner D. Predicting adult attention deficit hyperactivity disorder (ADHD) using vocal acoustic features. 2021. https://doi.org/10.1101/2021.03.18.21253108.
Elbow Method for Optimal Value of K in KMeans. GeeksforGeeks. https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/. Accessed 21 Nov 2024.
Silhouette Analysis. Scikit-Learn. https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html. Accessed 21 Nov 2024.
Gratch J, Artstein R, Lucas G, Stratou G, Scherer S, Nazarian A, Wood R, Boberg J, DeVault D, Marsella S, Traum D, Rizzo S, Morency L. The Distress Analysis Interview Corpus of human and computer interviews. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014,Reykjavik, Iceland, European Language Resources Association (ELRA). 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/508_Paper.pdf. Accessed 21 Nov 2024.
Pennebaker J, Booth M, Francis R. Linguistic inquiry and word count: LIWC 2001. Mahwah: Lawrence Erlbaum Associates; 2001.
Google Scholar
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics; 2014. pp. 1532–43. https://doi.org/10.3115/v1/D14-1162.
Sato N, Obuchi Y. Emotion recognition using mel-frequency cepstral coefficients. Inf Media Technol. 2007;2(3):835–48. https://doi.org/10.5715/jnlp.14.4_83.
Article Google Scholar
Pereira L, Salazar A, Vergara L. A Comparative Analysis of Early and Late Fusion for the Multimodal Two-Class Problem. IEEE Access. 2023;11:84283–300. https://doi.org/10.1109/ACCESS.2023.3296098.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Code availability

The code for all experiments is available on GitHub at https://github.com/George-Drg/Multimodal-vs.-Unimodal-approaches-for-identifying-mental-disorder-markers.git.

Funding

Not applicable.

Author information

Authors and Affiliations

Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Georgios Drougkas, Erwin M. Bakker & Marco Spruit
Public Health and Primary Care, Leiden University Medical Center, Leiden, The Netherlands
Marco Spruit

Authors

Georgios Drougkas
View author publications
You can also search for this author in PubMed Google Scholar
Erwin M. Bakker
View author publications
You can also search for this author in PubMed Google Scholar
Marco Spruit
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.S. was responsible for conceptualisation. G.D. created the software, performed the experiments, validated the results and created the binary label process. The methodology was developed by G.D. and M.S.. M.S. was responsible for project administration. M.S. and E.B. were responsible for supervision. G.D. and M.S. acquired the dataset resources. M.S. provided software resources. The original draft of this manuscript was written by G.D. and M.S.. Reviewing and editing was done by M.S. and E.B.. All authors contributed to the production and proofing of the manuscript.

Corresponding author

Correspondence to Marco Spruit.

Ethics declarations

Ethics approval and consent to participate

With reference number 2024-018, the Ethics Review Board of the Faculty of Science at Leiden University states that since we “[...] reuse the existing dataset DAIC-WOZ, for which all participants have completed a consent form, which included optional consent that allowed their data to be shared for research purposes, under the ownership of the University of Southern California, according to our checklist this data use does not need to be presented to our committee. We therefore waive the need for ethical approval for this study”.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Drougkas, G., Bakker, E. & Spruit, M. Multimodal machine learning for language and speech markers identification in mental health. BMC Med Inform Decis Mak 24, 354 (2024). https://doi.org/10.1186/s12911-024-02772-0

Download citation

Received: 16 August 2024
Accepted: 18 November 2024
Published: 22 November 2024
DOI: https://doi.org/10.1186/s12911-024-02772-0

Multimodal machine learning for language and speech markers identification in mental health

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Aim

Data overview

Methodology concept

Unimodal text model methodology

Data preprocessing

Feature extraction

Normalization

Binary label creation

Feature selection

Modeling

Unimodal acoustic model methodology

Data preprocessing

Feature extraction

Normalization

Feature aggregation

Example:

Binary label assignment

Feature selection

Modeling

Multimodal model methodology

Results

Unimodal text model

Features discussion

Results

Unimodal acoustic model

Features discussion

Results

Multimodal model

Discussion

Results

Overfitting prevention

Limitations

Challenges

Applications

Future work

Conclusions

Data availability

References

Acknowledgements

Code availability

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us