1 Introduction

Dementia is a set of symptoms consisting of cognitive impairment and brain function disorders “Dementia statistics, Alzheimer’s Disease International” [1]. The most common disease among adults over the age of 65 is Alzheimer’s disease (AD) which is a subset of dementia [2]. Alzheimer’s is a neurodegenerative disorder that causes neuronal death and brain tissue loss. The early stage of AD is mild cognitive impairment (MCI) which gradually progresses and can generally be divided into two stages; early mild cognitive impairment (EMCI) and late mild cognitive impairment (LMCI) [3]. Although not all MCI patients get converted into AD [4], MCI diagnosis is useful in predicting AD, with about 15% of MCI patients converting to AD every year [5] (refer to Appendix A for more information).

The process of diagnosing Alzheimer’s demands substantial knowledge and understanding to differentiate AD patients from healthy and MCI individuals by analyzing visible variations in brain regions. There are different methods to diagnose AD. The most common method to identify biomarkers associated with AD and cognitive impairment, as well as to assess changes in the brain caused by the disease is magnetic resonance imaging (MRI) modality. MRI is a non-invasive imaging method that uses the magnetic properties of the body and produces 2D and 3D images of any part of the body with high resolution. In general, humans are not able to detect abnormal patterns correctly or recognize the special characteristics related to AD simply and with more accuracy [6]. So, Computer-aided diagnosis (CAD) systems can provide better diagnostic suggestions by analyzing these patterns or any special changes in the brain efficiently. In the CAD system in recent decades, machine learning (ML) algorithms have played a crucial role in the field of AD prediction or diagnosis [7]. Various artificial intelligence (AI) approaches based on ML and Deep Learning (DL) have been investigated in brain disorders using MRI applications. In recognition of AD biomarkers, research has demonstrated that the ML methods have performed better than DL algorithms even with smaller databases [8]. In various medical applications, methods based on ML are often preferred over DL techniques because they can better detail the factors influencing the disease and the changes created. Despite the DL algorithm’s advantage in handling complex data such as MRI scans, ML remains a practical and straightforward approach to implement, providing detailed insights into the causes of Alzheimer’s and its subtypes [8].

The study in [9], introduced an approach to identify AD based on the application of ML techniques. They used histograms to transform brain images into feature vectors, containing the relevant brain features, which later served as the inputs in the classification step. They suggested the random forest (RF) classifier for discriminating the AD subjects from the control subjects, which achieved an accuracy rate of 85.77%.

In the research conducted by [10], a deep learning multi-layer perceptron classification method was proposed for the diagnosis of AD, healthy persons, and MCI patients based on the texture of the hippocampus (HC). This research obtained 72.5%, 85%, and 75% for each group; AD vs. MCI, AD vs. normal controls, and MCI vs. normal controls, respectively.

The study performed by [11], proposed a transfer learning framework based on the convolutional neural network (CNN) architecture for classifying Alzheimer’s images into four classes: healthy person, EMCI, LMCI, and AD. They used layer-wise transfer learning as well as tissue segmentation of brain images for their goal and achieved 98.73% for AD vs. healthy person, 83.72% for EMCI vs. LMCI, and more than 80% for other groups.

Early diagnosis of Alzheimer’s and different stages of cognitive impairment are important to prevent their progression. In the previous papers, the diagnosis of various stages of AD and MCI has been less discussed. This study has tried to diagnose different stages of MCI with significant accuracy and distinguish it from Alzheimer’s and healthy people with MRI image processing methods. We used a unique combination of different brain area sizes and hippocampus grayscale statistical and texture features derived from 2D-MRI images to diagnose AD, CN, EMCI, and LMCI. In this study, the extraction of the specific brain features, such as the hippocampus and lateral ventricle size, based on established clinical knowledge has been utilized to provide detailed insights about the structural and textural changes associated with AD. This approach enhances early diagnostic accuracy and aids in the development of diagnostic biomarkers for different stages of MCI and AD. By focusing on transparency and specific feature analysis, this can be a valuable complement to existing DL research.

The proposed method provides a more comprehensive understanding of how the disease progresses compared to studies focusing on a single type of feature. Also, our analysis of each extracted feature’s strength and direction of influence on the classification of each group, supported by correlation and p-value calculations, provides new insight into the relationship between brain structure changes and AD. Furthermore, by demonstrating the effectiveness of 2D-MRI imaging data, our study extends the application of advanced diagnostic techniques using ML algorithms in clinical environments where 3D-MRI imaging may not be feasible, and it facilitates the diagnosis and intervention in a shorter time with significant accuracy. The use of ML algorithms instead of DL algorithms is recommended in the proposed method because it provides more interpretability and allows us to understand which brain features in MRI images have the most impact on the classification of AD and how these features change with the disease progression. Our dataset’s relatively small size makes ML more suitable due to its lower risk of overfitting compared to DL models. In addition, ML models are more computationally intensive, which can be practical for various clinical applications.

2 Methods

The goal of this study is to present the method for distinguishing the six groups: CN vs. AD, EMCI vs. LMCI, CN vs. EMCI, CN vs. LMCI, AD vs. EMCI, and AD vs. LMCI, using two-dimensional T1-weighted coronal MRI analysis of the brain based on combined MRI features and ML algorithms. The proposed method endeavored to provide the ML model by extracting efficient and optimized features based on brain MRI images to enhance diagnostic accuracy which can potentially be used clinically. Furthermore, we assessed the correlation and dependence metrics to examine the strength and direction of association between each extracted feature and the classification of each group, these techniques are well-established and provide valid results. One of our main goals was to confirm the clinical relevance of the extracted features based on the analysis of MRI images and available neurobiological knowledge, and also we have tried to show how the extracted features change with the AD and MCI progression.

In this study, each label; 0, 1, 2 and 3 were used to represent CN, AD, EMCI, and LMCI groups, respectively. The feature and target matrices throughout the machine learning procedure have also been completed with the information of each group in the same order. The dataset was randomly split into 70% for training and 30% for testing. This process was repeated 20 times to ensure the performance of the model and the obtained results. Each trial used a different random seed to generate unique training and testing splits. The classifier’s performance was reported as both the average result and the best result across these 20 trials.

This section describes the proposed MRI image processing based on ML algorithms. For the implementation of the proposed model to classify the patients and healthy persons, ImageJ and MATLAB (R2023a) programming for preprocessing, segmentation, and feature extraction steps and Spyder (Python\(-\)3.10 programming) for the classification process has been used. In Fig 1, the proposed ML algorithms for each step in MRI image analysis have been shown.

Fig. 1
figure 1

The proposed image processing procedure based on the machine learning algorithms

2.1 Database

In this paper, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database has been utilized. The data resources from the North American ADNI include medical images of MRI and PET, clinical information, cognitive test scores, and biomarkers related to people with Alzheimer’s, cognitive impairment in different stages, and healthy persons (https://adni.loni.usc.edu). This database has collected the validated data and provided a platform for researchers to work on Alzheimer’s disease. The dataset used in this study includes two-dimensional T1-weighted-MRI images in the coronal section from all individuals of 4 categories: cognitively normal (CN), EMCI, LMCI, and AD obtained from ADNI1, ADNI2, ADNI3, and ADNI-GO data. These images consist of 100 subjects in each category with gender equality and age between 18 and 96. The MRI protocol and clinical information of subjects who participated in this study database are shown in Fig. 2 and Table 1, respectively.

Fig. 2
figure 2

The database imaging protocol

Table 1 The clinical information of subjects

2.2 Preprocessing and segmentation

After collecting the data, a series of preprocessing steps were applied to transform the raw MRI images into meaningful and usable information for subsequent analysis. In this paper, we applied a 3x3 median filter to remove the noise and the contrast enhancement technique to adjust the gray levels and improve the image’s visibility to distinguish better the objects in the images. In this step, improving the quality of the images while preserving the edges and texture information has significant importance. Then, the Skull-stripping method is applied to remove the skull and non-brain tissues from the images. Skull stripping typically is based on thresholding techniques. The quality of applying the skull-stripping technique and removing the unwanted tissues can be affected by various factors, including the imaging protocol and MRI scanners, etc [12]. Furthermore, the variability of anatomy, age, and the extent of brain atrophy especially in different stages of MCI and AD, have an impact on skull stripping as well [13]. So, this step was performed with high accuracy, and each MRI image of the patient was evaluated.

After applying preprocessing techniques to the images, the regions of interest including; the hippocampus, lateral ventricles (LV), gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF), have been segmented. The two highlighted regions in the proposed method; the hippocampus and the lateral ventricles, are presented in Fig. 3.

Fig. 3
figure 3

The two-dimensional T1-weighted coronal MRI images for each category; CN, EMCI, LMCI, and AD (from left to right). The hippocampus regions and their magnifications are shown in the rectangles with the yellow dashed line, and the lateral ventricle regions are shown with the red dashed line

The hippocampus is a region of the brain that has a curved shape. Throughout the progression of MCI and AD, this region gets atrophy and its structure shape changes. This change can happen differently for each patient. So, the segmentation of this region faces challenges, especially in 2D MRI images. In this study, we used the ImageJ tool to draw polygons around the hippocampus, which can provide high accuracy in the segmentation of this region for all groups. Then, we used the multi-thresholding method based on the Otsu technique to segment the lateral ventricle, gray matter, white matter, and cerebrospinal fluid. Otsu’s technique is a multi-thresholding approach that effectively can separate pixels in an image into distinct groups based on the histogram of the image [14, 15]. This technique is particularly effective in complex medical imaging applications [16, 17]. In this study, Otsu’s thresholding was implemented by using MATLAB functions. This function calculates the global threshold level by maximizing the between-class variance \(\sigma _B^2(T)\), which is defined as:

$$\begin{aligned} \sigma _B^2(T) = \omega _1(T) \omega _2(T) \left[ \mu _1(T) - \mu _2(T)\right] ^2 \end{aligned}$$
(1)

where T is the selected threshold that divides the two classes, \(\mu _1(T)\) and \(\mu _2(T)\) are the mean intensities of each class, and \(\omega _1(T)\) and \(\omega _2(T)\) indicate the probabilities of each class being separated by threshold T. This function determines the threshold T that maximizes \(\sigma _B^2(T)\) and yields optimal segmentation of the image containing the relevant regions.

2.3 Feature extraction

This study employed a comprehensive set of features related to brain structural/tissue changes, and clinical evaluations to maximize diagnostic accuracy for distinguishing different stages of MCI and AD by using MATLAB functions. After segmentation of regions of interest, morphological features including area sizes of the hippocampus, lateral ventricles, gray matter, white matter, and cerebrospinal fluid were extracted to assess brain structural changes. First-order statistical features were computed using means, standard deviations, skewness, and kurtosis of the intensity values of the hippocampus for evaluating tissue changes in this region. Additionally, texture features were derived using the gray-level co-occurrence matrix (GLCM) to quantify contrast, correlation, homogeneity, energy, and entropy of the hippocampus region by providing tissue heterogeneity. Invariant moments (IM with 7 components) were also computed to describe shape changes and pattern recognition in this region. Furthermore, a wavelet transform (WT-Haar wavelet with 65536 components) was extracted to obtain multi-resolution features from the entire brain. Clinical information such as age, weight, and mini-mental state examination (MMSE) scores of the patients were also used.

Subsequently, feature normalization was implemented to enhance the machine learning performance speed and its efficiency [18]. Feature normalization refers to rescaling the input features by the minimum and range to make all values lie between 0 and 1 [19].

High-feature matrix dimensional analysis and overfitting pose challenges for researchers and engineers in the fields of machine learning and data mining. Feature selection serves as an effective solution to address these issues by eliminating irrelevant and redundant data, thereby reducing computation time and enhancing the learning model procedure [20]. In this study, the obtained feature matrix dimensions have been large especially, due to the extraction of WT features. So, we examined two feature selection approaches; principal component analysis (PCA) as a first approach and making feature matrices based on different combinations of extracted features as a second one.

PCA is a statistical feature selection algorithm that transforms the large feature matrix into a smaller matrix by recognizing the efficient features [21]. Our feature matrix was reduced to 100 features by the implementation of PCA. We evaluated our classifiers using the first 20 features of this matrix. Additionally, as mentioned before, to find the best combination of 65560 features, we tried to combine the extracted features manually to find the best feature matrix for the classification of all groups.

2.4 Classification

Classification is the predicting or identifying process of new data relying on a training dataset consisting of extracted useful information [22]. In this study, we investigated several ML methods for classifying AD and different stages of MCI. Here, recent methods have been used for distinguishing 6 binary groups; CN vs. AD, EMCI vs. LMCI, CN vs. EMCI, CN vs. LMCI, AD vs. EMCI, and AD vs. LMCI. The KNN is an algorithm that uses the calculation of the distance between new data and all training data and makes a decision relying on the majority vote to categorize the new data [8]. Support vector machine (SVM) is a method that categorizes the new data by finding the maximum margin between classes and the optimal hyperplane [8, 23]. Decision tree (DT) is a recursively selecting model that divides the data based on the most significant feature [24]. Random forest (RF) is the combination of several decision tree models that can provide more accurate predictions [8, 25]. Lastly, a multi-layer perceptron (MLP) is a feed-forward neural network consisting of activation functions and neuron layers to explore the complex patterns in data [26, 27].

In the proposed method, KNN uses \(k=7\) for the number of neighbors (we assessed this classifier with different numbers of K from 3 to 10, that \(K=7\) having the best result) and euclidean metric for distance computation. SVM uses the Linear kernel, and \(C = 1\) for the regularization. DT uses the Gini criterion for measuring the quality of a split, the default best strategy for choosing the split at each node, and the unlimited depth and unlimited depth until all leaves are purged or until all leaves have less than the minimum number of samples required to split an internal node. RF consists of 100 trees with the same parameters which is defined for DT algorithm. MLP includes 100 hidden layers, ReLU function activation, and the Adam optimizer. These classifiers with described parameters are used to recognize 6 target binary groups.

2.5 Statistical analyses

To investigate the performance of the classifiers and assess the impact of each extracted feature on the diagnosis of six target groups, the statistical metrics such as; accuracy, sensitivity, and specificity have been calculated and their formula is as follows:

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(2)
$$\begin{aligned} Sensitivity = \frac{TP}{TP+FN} \end{aligned}$$
(3)
$$\begin{aligned} Specificity = \frac{TN}{TN+FP} \end{aligned}$$
(4)

where the patient is determined with a positive label and the healthy subject is determined with a negative label. Also, TP expresses true positive (persons that identified correctly as patients), FN indicates false negative (persons that recognized incorrectly as healthy), FP is used for false positive (persons that determined incorrectly as patients), and TN is used for true negative (persons that identified correctly as healthy). The accuracy measures the ability of a model to differentiate the patient and healthy cases correctly. The sensitivity measures the ability of a model to determine the patient cases correctly. The specificity measures the ability of a model to determine the healthy cases correctly [28].

To examine the strength and direction of association between each extracted feature and classification of each group, the Spearman and Pearson coefficients (correlation coefficients and P-value), and mutual information were calculated. This exploration aimed to identify which features contain crucial information for diagnosing different stages of MCI and AD (refer to Appendix B for more information).

3 Result

As mentioned before, we extracted 65,560 features in this study. As a first approach, we used PCA to reduce the feature dimension to 100 features. The first 20 features of this new 100 features were evaluated. As a second approach, to find the best combination of 65560 features, we tried to combine the extracted features manually to find the best feature matrix for the classification of all groups. We found out that the feature matrix combining all extracted features excluding WT and IM features has a significant performance (65560 (all) - 65536 (WT) - 7 (IM) = 17). The WT and IM did not have an acceptable result, so they were removed from the feature matrix. This 17-feature includes: HC area size, HC grayscale statistics (mean, standard deviation, skewness, kurtosis, contrast, correlation, energy, homogeneity, entropy), lateral ventricle (LV) area size, gray matter area size, white matter area size, cerebrospinal fluid area size, patient age, weight, and cognitive score. Then, the obtained 20 features from PCA were compared with the 17 features obtained from the manual combination. Among these two approaches, the 17 features had a significant performance. Consequently, this study focused on the 17 features mentioned for further analysis.

To identify which features among these 17 features contain more crucial information for diagnosing different stages of MCI and AD, and how these features change with AD progression, the strength and direction of association between each extracted feature and classification of each group were examined by calculating the Spearman and Pearson coefficients (correlation coefficients and P-value), and mutual information. The results of these metrics have been shown in Figs. S1S5 in the supplementary file.

The five proposed classifiers were compared with each other based on 17 selected features. The obtained results for each binary classification; CN vs. AD, EMCI vs. LMCI, CN vs. MCI, CN vs. LMCI, AD vs. EMCI, and AD vs. LMCI, have been shown in Table 2. This table presents the average result of each performance metric in 20 trials and the best result that has been obtained during these trials.

According to the result table, the proposed method in this study obtained these best average accuracies of the classification in 20 trials; 95% (SVM), 71.50% (RF), 82.58% (RF), 84.91% (SVM), 85.83% (RF), and 85.08% (RF) for each group; CN vs AD, EMCI vs LMCI, CN vs EMCI, CN vs LMCI, AD vs EMCI, and AD vs LMCI, respectively. In addition, the best accuracies derived in 20 trials are; 100% (SVM, RF, and MLP), 83.33% (RF), 91.66% (RF), 95% (SVM, and MLP), 96.66% (RF), and 93.33% (DT), for each above-mentioned group, respectively.

Table 2 The classifier’s performance comparison for each group; CN vs AD, EMCI vs LMCI, CN vs EMCI, CN vs LMCI, AD vs EMCI, and AD vs LMCI

4 Discussion

In this study, we proposed a medical image processing technique based on ML algorithms that describes what features and classifiers have good performance for distinguishing six groups of individuals: CN vs AD, EMCI vs LMCI, CN vs EMCI, CN vs LMCI, AD vs EMCI, and AD vs LMCI. This discrimination is achieved through the analysis of the two-dimensional T1-weighted coronal MRI. In addition, we conducted correlation, dependence, and mutual information analysis to examine the strength and direction of association between each extracted feature and the classification of each group. Corresponding to the statistical figures, both Spearman and Pearson correlations showed similar performance on average. Our analysis of extracted features revealed several key insights into their effectiveness. The results (coefficients and P-values) indicate that the MMSE score feature had the most correlation relationship with the classifications of CN vs AD, CN vs. LMCI, AD vs. EMCI, and AD vs. LMCI. Its high correlation across these multiple classifications emphasizes its role as an important indicator of cognitive impairment. Additionally, based on these coefficients and P-values, the hippocampus grayscale entropy feature for the EMCI vs. LMCI classification and the hippocampus grayscale correlation feature for the CN vs. EMCI classification had the most dependent relationship. These features reflect changes in hippocampal tissue properties relevant to disease progression. Furthermore, the MI figure shows that the MMSE score feature includes high mutual information for the classifications of CN vs AD, AD vs. EMCI, and AD vs. LMCI. Additionally, the WM area size feature for the EMCI vs. LMCI classification and the hippocampus grayscale entropy feature for the CN vs. EMCI and the CN vs LMCI classifications contain powerful information. It highlights structural changes in white matter associated with cognitive decline. Also, in general, the hippocampus and lateral ventricle size demonstrated significant potential for the classification of all groups. This result aligns with the established medical report and supports them as reliable biomarkers for AD and different stages of MCI. These findings highlight the importance of considering both correlation and mutual information analyses to identify features containing crucial information for diagnosing AD and MCI, thereby aiding in the development of impressive diagnostic biomarkers.

In evaluating the performance of different classifiers, our results present that various models show different strengths. The SVM classifier achieved the highest average accuracy for the classification of CN vs. AD (95%) and has had a remarkable performance for other group classifications. This shows that SVM is highly effective in detecting groups that have significant differences in their features. Also, the RF classifier showed strong performance in most classifications with average accuracy ranging from 71.50% to 96.66%. The compatibility and strength of RF make this classifier appropriate for classifying different feature sets and performing complex classification tasks. Moreover, the MLP and DT classifiers showed competitive performance, especially achieving 100% accuracy in several groups. Although the KNN classifier also did not have the highest performance compared to SVM or RF, it showed that it can show strong potential in situations where understanding local data patterns is essential for accurate classification. These results emphasize the effectiveness of SVM and RF in diagnosing AD and its subtypes. The diverse performance of different classifiers indicates that the different kinds of models will have different potentials in data classification and for achieving the best results, the appropriate model should be selected based on specific classification tasks and feature properties.

In our proposed method, we employed several strategies to overcome possible overfitting or reduce it in our model. First, to reduce the data dimension, we carefully selected the most relevant features for our models to avoid overcomplicating the models and fitting noise in the data. Also, we utilized a cross-validation method by randomly splitting the dataset into training and testing data and repeating this process 20 times with different random seeds. This approach allows us to validate the performance of the model in different subsets of the data and avoids overfitting to any particular set of data. Additionally, we used different classification methods such as ensemble methods like RF, which is made by combining several DT algorithms. It is notable that in the proposed method, ML techniques have been preferred over DL models because of their better performance on smaller data sets and reducing the risk of overfitting by having fewer parameters. These approaches collectively can help us to enhance the reliability of our framework in classifying different stages of MCI and AD.

5 Conclusion

This study aimed to provide a method to use a comprehensive set of features related to brain structural/tissue changes, and clinical evaluations to maximize diagnostic accuracy for distinguishing different stages of AD and MCI. In the proposed method, several strategies have been employed to overcome possible overfitting or reduce it in the model. Also, this study has tried to investigate the effect of each extracted feature on the classification of the target groups and to evaluate the relationship between each extracted feature and the target groups in order to better recognize how the brain regions and tissue properties change with AD and MCI. These findings highlight the significance of using 2D-MRI image processing techniques and statistical analyses to identify key features that contain crucial information for diagnosing AD and different stages of MCI. This approach can enhance the accuracy of diagnosis and be used in clinical applications to help specialists. Early diagnosis of AD and differentiation of its symptoms from normal age-related cognitive decline and various stages of MCI have long been a challenging problem. Consequently, studies like this paper can be effective and significantly contribute to the development of important biomarkers related to AD and the prediction of that.

For future work, we plan to use a larger database and automated feature selection techniques in ML models. This approach can combine the strengths of neurobiological knowledge and extracted features to lead to model robustness and recognize new insights into AD progression. Also, we aim to increase the classification accuracy of each group by integrating different 2D views, including sagittal and axial. Additionally, we are interested in combining the two modalities; MRI and PET to have more information about the brain changes related to AD and enhance the accuracy of prediction and diagnostic procedure.