1 Introduction

Diabetes is a silent chronic disease that affects people who do not have sufficient insulin hormones or when body cells develop resistance to insulin. Insulin is a hormone that is generated by a gland below the stomach called the pancreas. Insulin is essential for controlling blood glucose levels because it helps cells take up glucose from the blood stream so that it can either be used immediately for energy or stored for future use. When blood glucose levels rise over time without being controlled (diabetes), the body experiences severe health issues such as lower limb loss, blurred vision, heart disease, and stroke [1].

In particular, there are three forms of diabetes mellitus. Type 1 diabetes (T1D) is a state where the pancreas either does not generate insulin or cannot generate sufficient insulin. Type 1 diabetes is more common in children and young people. If T1D patients are at very high risk, they will need intensive medical attention [2]. Type 2 diabetes (T2D) is a state in which the produced insulin is not able to keep the blood glucose level stable throughout the body, and it most commonly occurs in individuals over the age of 40. The most common type of diabetes is T2D, accounting for 90–95% of all instances of diabetes that have been diagnosed worldwide [3]. Another form of diabetes known as “gestational diabetes” occurs when the body’s tissues do not react to insulin, even though the pancreas generates normal insulin levels. If gestational diabetes is not treated, it can increase the chance of developing T2D in the future [4]. According to the World Health Organization (WHO), in 2019, the mortality rate was estimated to be approximately 1.9 million because of diabetes, and diabetes is considered the main cause of death globally [1]. The International Diabetes Federation (IDF) reported that the number of diabetes-affected individuals would increase to 783 million by 2045 [5]. Therefore, it is possible to classify and estimate the likelihood of diabetes, which can significantly reduce healthcare expenses [6]. In such situations, an integration of the Internet of Medical Things (IoMT) and machine learning (ML) algorithms could be readily available to help medical professionals detect and diagnose diabetes earlier by offering predictive tools to enable more rapid and effective decision making.

The goal of this study is to create an IoMT application that can classify diabetes through an e-diagnosis model. Artificial intelligence (AI) models are extensively used for building classification models using medical data. ML is a branch of AI that focuses on designing ML algorithms that are capable of handling challenging tasks such as classification, prediction, or evaluating massive amounts of data. Recent studies [7,8,9] have suggested many ML algorithms for classifying diabetes data, such as decision tree (DT), logistic regression (LR), and XGBoost classifiers (XGB) [8, 9]. However, DTs attain less accuracy because of data imbalance and the lack of a feature selection algorithm. The LR algorithm is sensitive to imbalanced data, where one class has many more samples than the other, resulting in poor accuracy with respect to the data. If the data contain outliers and missing values, the LR method might result in inaccurate classification. The accuracy of the XGBoost classifier is impacted by the poor quality of data collected from heterogeneous sources, which leads to poor model accuracy.

In addition to the above issues, the datasets are highly imbalanced, with a high degree of bias toward a particular class. Thus, data balancing and handling missing data are crucial steps. Furthermore, data preprocessing involves the handling of missing data such that it does not introduce bias into the results. Thus, the work addresses this crucial aspect by effectively handling the missing values and outlier data via imputation via the mean and interquartile (IQR) techniques, respectively. Imputation via the mean is a simple technique with no complex estimation, and it allows quick preprocessing of data. The IQR outlier detection technique is introduced in this work for handling outliers in the data so that it can eliminate all the nonnormal distributions in the dataset. In addition, random oversampling is added to address the class imbalance data and to increase the accuracy of the model. These techniques can enhance the overall performance of the system and help achieve better accuracy.

Although ML algorithms are pervasive in healthcare sectors, their real clinical application rate is quite low because of the lack of explanation of significant features and the way in which the choice of features impacts the performance of the model via feature extraction.

The contributions of this research are as follows:

  • Concentrated in elimination of missing values and handling outliers via the imputation and IQR techniques, respectively

  • Significant features are selected by introducing the PCA and Boruta algorithms.

  • Four different ML models, the light gradient boosting method, the gradient boosting algorithm, the RF, and the DT classifier, are proposed to classify diabetes.

  • A novel ML framework for the classification of diabetes mellitus was designed.

The remainder of the article is structured as follows. Section 2 presents related work based on predictive ML techniques for diabetes classification and the use of boosting techniques to improve model performance. Section 3 presents the details of the benchmarking PIMA and BRFSS datasets and presents the data distribution analysis. Section 4 describes the proposed approach, where IQR and random oversampling techniques are presented. Section 5 presents the performance analysis of the above datasets on select parameters, with a comparative analysis with previous approaches, and finally, Sect. 6 presents the conclusion and future scope of the work.

2 Related Work

The use of the Internet of Things (IoT) in the medical industry refers to the Internet of Medical Things. IoMT links medical devices and their associated applications with health care information technology (IT). This advancement has transformed the medical field with its innovative remote medical care model by means of social advantages and accurate diagnosis [6]. Gaining from the continuous computing of the IoT makes it simpler to achieve healthcare objectives such as clinical data, prescriptions, medical equipment, and treatments. The growth of IoMT has significantly transformed the management of disease, improved disease diagnostic and treatment approaches, and reduced healthcare expenses and mistakes. The utility of IoMT is increasing as a consequence of the synergistic emergence of AI. However, one of the major complications brought about by the progress that many academics have faced is data generation. Owing to the sheer volume of data collected via ML technology, which excels at analysis, interpretation, and extraction of useful information from vast amounts of data, the data can be displayed.

In particular, the integration of AI and IoMT can offer two advantages for the detection and management of chronic diseases. The first advantage is that an AI-powered e-diagnostic model efficiently evaluates and classifies patient data collected in the IoMT to generate the initial diagnosis. It can also support a physician’s ultimate diagnosis and treatment plan formulation. Another advantage is that this e-diagnosis technology enables remote patient monitoring and supervision for those with chronic diseases. Figure 1 shows an e-diagnosis technology for diabetes classification in IoMT patients. This system uses ML technology to classify diabetes using patient data, provides physicians with an initial diagnosis, and offers patients feedback on their doctor’s recommendations for blood glucose monitoring and diet. Furthermore, the IoMT facilitates the interoperability of medical equipment, applications, and systems, as depicted on the left-hand side of Fig. 1. As a result, regardless of whether they are rural hospitals or large healthcare institutions, doctors from various health care organizations can share and evaluate patients’ data remotely via the internet. This allows for a significant reduction in the number of medical files and eliminates the need for the patient to visit the same hospital or even for subsequent appointments in person.

Fig. 1
figure 1

E-diagnosis technology for diabetes classification in IoMT patients

Several studies have used ML or AI to predict diabetes. In this section, we present different ML models used for diabetes prediction. The T2D predictive model [7] was implemented on a Korean population dataset. Significant attributes were selected via the data-driven feature selection algorithm. These features were used in XGBoost, RF, and LR to design the prediction model. After the model was examined, a maximum accuracy of 73% was attained via the RF algorithm. Kandhasamy et al. [8] compared the efficiency of the K-nearest neighbors, random forest (RF) classifier, support vector machines (SVMs), and J48 DT classifier. They utilized California University’s data for the research. Among all the other techniques, the J48 DT algorithm performs well, with an accuracy of 73.82%. Mohamed Ahmed [9] introduced a diabetes prediction model based on naïve Bayes (NB), a logistic algorithm, and the J48 algorithm. This study used real-time data from U.S. hospitals to conduct the investigation. The model was trained using three sizes of training data: 50%, 65%, and 80%. For 80% of the training data, the maximum accuracy of 74.4% was attained via the LR algorithm.

In terms of data preprocessing, Azrar et al. [10] utilized several data mining techniques, such as replacing missing values by means and converting data into categorical forms for preprocessing. After three different prediction algorithms were analyzed, the DT algorithm produced a maximum accuracy of 75.65%. In diabetes prediction, ontology-based algorithms were considered by the authors in Ref. [10], and they achieved a prediction accuracy of 77.5%, surpassing that of the SVM, NB, LR, and DT algorithms [11]. Feature extraction algorithms such as k-means clustering, principal component analysis (PCA), and ranking of feature importance were performed on the PIMA Indian Diabetes Dataset (PIDD) [12]. The prediction accuracies of the RF, NB, and J48 DTs were examined on the basis of 3 significant features and 5 significant features. When the RF model with three features was used, the greatest accuracy of 79.57% was attained. Diabetes studies have also used support vector machines [13], CN2 rule induction [14], and the XGBoost classifier [15], which yield accuracies of 74%, 8.7%, and 81%, respectively.

The prediction technique was enhanced by the authors of Ref. [16], who proposed the XDAGBoost and AdaBoost SVM techniques in PIDD. Research shows that, compared with XGBoost, DT, LR, SVM, and RF, the AdaBoost algorithm performs well, attaining an accuracy of 83%. Sarangani et al. [17] implemented the PCA-based feature extraction technique with the RF classifier in PIDD and reached an accuracy of 83%, whereas they implemented the SVM classifier and achieved 81.24% accuracy. Various performance metrics, such as sensitivity and specificity, were also investigated in this study. In contrast to ML, deep learning (DL)-based algorithms are selected for designing automatic prediction models [18]. Soft voting XRL hybrid models were introduced in the NHANES dataset along with XGBoost, RF, and LightGBM classifiers for designing the ML-based diabetes diagnosis model [19]. The XRL model produces 89.46% and outperforms all the other three algorithms.

Shankar et al. [20] investigated the gray wolf optimization (GWO) metaheuristic for diabetes detection on the PIMA dataset. Research has shown that the GWO outperforms the ant colony method and achieves 81% accuracy. Lukmanto [21] utilized the SVM approach for diabetes classification. The F-exponential method was proposed to select significant features from all the features. The training data were subsequently input into the SVM algorithm for diabetes classification, which achieved 89.2% accuracy on the PIMA dataset. Beschi et al. [22] introduced a novel system for classifying diabetes patients on the basis of fuzzy C-means clustering (FCM) and particle swarm optimization (PSO). An evaluation of the performance analysis revealed that the model achieved 82.6% accuracy. Saloni [23] implemented a hybrid voting classifier using three ML algorithms, RF, LR, and NB, and achieved 79.04% accuracy. The research mentioned above used a variety of ML models with varying degrees of accuracy to predict diabetes illness. Soft voting XRL hybrid models were introduced in the NHANES dataset along with the XGBoost, RF, and LightGBM classifiers [24]. Optimization algorithms with AI, such as the Greylag Goose optimization (GGO) algorithm [41], pressure optimizer [40], and particle swarm optimizer [39], have also been implemented.

Machine learning has recently undergone many studies to identify the existence of diabetes effectively at early phases. According to recent research, researchers have introduced various ML models to achieve plausible results. Even though they achieve better accuracy, they fail to address bias, balance the dataset, eliminate outliers, and select significant diabetes parameters. Therefore, it was clear that the result was not validated because the presence of bias, imbalanced dataset, and absence of a feature selection algorithm led to erroneous accuracy. In the literature, handling the missing values, focusing on outlier detection, dataset balancing, and concentrating on feature selection methods were observed to be the crucial steps for diabetes classification research. When there are more missing values in the dataset, machine learning models frequently have trouble producing excellent results during the training stage. The identification and handling of the missing values for each input feature is a significant step in the preprocessing phase. The findings demonstrate that preprocessed data offer a higher level of classification accuracy [25]. The primary aim of this work is to introduce a suitable method for handling missing data, outliers, and imbalanced datasets. It also focuses on selecting the best feature extraction technique for identifying the most significant features and choosing the best ML technique out of four alternate models for the classification of diabetes. In this research, the ML classifiers light gradient boosting model, boosting classifier, RF, and DT are used to study the PIDD and BRFSS datasets. The assessment of the employed classifiers is also well structured in this work.

3 Dataset

3.1 Data Acquiring

In this work, we make use of two different datasets: the PIDD (PIMA Indian Diabetes Dataset) [26] and the BRFSS (Behavioral Risk Factor Surveillance System) datasets [27].

3.1.1 PIDD (PIMA Indian Diabetes Dataset)

The PIMA Indians Diabetes Dataset (PIDD) [28] is a well-known dataset used in machine learning and data science for predicting diabetes outcomes, specifically type 2 diabetes. It was collected by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and comprises 768 records of female patients from the PIMA Indian population, a group with a high incidence of diabetes. The dataset includes 8 diagnostic features: the number of pregnancies, plasma glucose concentration (measured 2 h after an oral glucose tolerance test), diastolic blood pressure (mm Hg), skinfold thickness (a measure of body fat), insulin levels, body mass index (BMI), diabetes pedigree function (which indicates the likelihood of diabetes on the basis of family history), and patient age. The target variable is a binary label indicating whether the patient has been diagnosed with diabetes (1) or not (0). The dataset is particularly useful for classification models and has been widely applied to predict diabetes outcomes via various machine learning techniques. The attributes collected represent common health measurements, and the binary outcome label allows for supervised learning tasks aimed at predicting the risk of diabetes on the basis of the provided features [24]. Table 1 shows the features of the PIMA dataset.

Table 1 Features of the PIMA dataset

The histograms of the PIMA and BRFSS datasets are shown in Figs. 2 and 3.

Fig. 2
figure 2

Histplot of the PIDD dataset

Fig. 3
figure 3

Histplot of the BRFSS dataset

The histogram diagram of the PIMA dataset helps to graphically visualize the distribution of its features. This graph clearly shows that, with the exception of the feature “Outcome”, all the other features vary in range, whereas the outcome is either “1” or “0”. In the PIMA dataset, the first 8 features are given as input attributes to the proposed model, and the last feature, “outcome,” is taken as the target class.

3.1.2 BRFSS (Behavioral Risk Factor Surveillance System)

The Behavioral Risk Factor Surveillance System (BRFSS) dataset [29] is one of the largest continuously conducted health surveys in the world and was developed by the Centers for Disease Control and Prevention (CDC). It collects data from U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. The survey is conducted annually and is used to track health trends over time, with a particular focus on chronic diseases such as diabetes. The dataset includes a wide range of attributes, such as demographic variables (age, sex, race, education level, and income), lifestyle behaviors (such as smoking, alcohol consumption, physical activity, and diet), and health status variables (including body mass index, cholesterol levels, and blood pressure). Table 2 shows the features of the BRFSS.

Table 2 Features of BRFSS dataset

For diabetes outcomes, the BRFSS survey specifically asks participants if they have ever been diagnosed with diabetes by a healthcare professional. The responses are used to label individuals as diabetic (1) or nondiabetic (0). In addition, gestational diabetes (diabetes during pregnancy) is often treated separately in the analysis. The dataset provides crucial insights into how behavioral and demographic factors contribute to the prevalence of diabetes, making it valuable for public health research, policy planning, and preventive healthcare interventions. The richness of the BRFSS dataset, combined with its large sample size, makes it a key resource for studying health behaviors and outcomes associated with diabetes across different populations.

4 Proposed Method

In this section, the proposed method for ML-based diabetes classification is explained. Figure 4 presents the details, where the model flow is organized into six stages. The complete model implementation is performed in Google Colab via Python. The Pandas, Sklearn, Pandas, Numpy and Matplotlib packages were used to evaluate the PIMA and BRFSS datasets.

Fig. 4
figure 4

The proposed approach for predicting diabetes using ML

4.1 Preprocessing

Before introducing the ML algorithms, the datasets need to be preprocessed. The performance of the ML model is affected by the presence of missing data in features. To find the missing values, a missing value testing step is applied to the dataset. One of the more popular methods for addressing missing values is imputation [30]. The mean of the appropriate feature was used to replace the missing data. It replaces missing values with the mean of the nonmissing values in that feature. In this research, mean imputation is preferred since it is simple to apply and does not affect the sample size of the data. An algorithm called IQR outlier identification has been used to find outliers. The IQR method, a standard tool for outlier detection, is preferred over other techniques because it can provide a more reliable representation of the spread even if the data are not normally distributed.

The steps given below discuss the process followed in the IQRs evaluation.

  1. 1.

    Start the process

  2. 2.

    Sort dataset in ascending order

  3. 3.

    Determine Q1 (median of the lower half of the features)

  4. 4.

    Determine Q3 (median of the upper half of the features)

  5. 5.

    Calculate the IQR (difference between Q3 and Q1)

  6. 6.

    Identify outliers:

    • Any value below Q1 − 1.5 (IQR)

  • Any value above Q3 + 1.5 (IQR)

  • Remove outliers from dataset

  • Stop the process Determine Q3 (median of the upper half of the features)

After applying this IQR algorithm to the PIMA and BRFSS datasets, outliers are eliminated, and this can be seen by plotting the boxplots of the PIMA and BRFSS datasets. The PIDD and BRFSS datasets are not evenly distributed, where one class has a disproportionately lower number of data than the other. In these situations, the system might not perform well on the minority class since there are not enough minority class data to properly train it. To handle this case, a random oversampling technique was proposed for this model. The random oversampling method is chosen over other sampling techniques because of its efficiency and simplicity [24]. Research also shows that the random oversampling technique works better for imbalanced datasets than other sampling methods do [24]. It equalizes the dataset and enhances classifier performance by randomly or arbitrarily oversampling the minority group. This random oversampling approach increases the number of data in the minority group and provides more information to the ML classifier to learn more. As a result, the model can improve its performance by learning both classes (minority and majority classes) and produces better accuracy. In addition, it has been demonstrated to enhance classifier performance on imbalanced datasets, especially when the minority class is very small.

In our study, we employed the interquartile range (IQR) method for feature selection to filter out outliers and select features that significantly contribute to the model’s performance. This approach helps ensure that only relevant features are retained while maintaining the robustness of the model by reducing noise. To handle class imbalance, we applied random oversampling. This technique was chosen to address the issue of imbalanced classes by duplicating instances from the minority class, thereby preventing the model from being biased toward the majority class.

Splitting the dataset is the most popular technique used in the ML models for validating the model. In this case, the data in the dataset are split into testing and training sets. A popularly.

The training test ratio is 80:20, in which 80% of the data are used as the training set and 20% of the data are used as the testing data. Many other ratios, such as 60:40, 50:30, and 70:30, can also be used for processing [31]. Most recent research papers on diabetes classification have used 80:20 and 70:20 ratios to validate their models. In this work, the training test split function is used to randomly select 20% or 30% of the datasets for testing and 80% or 70% of the datasets for training. The system is trained using the training data. The testing data are used to test the performance of the model and determine its accuracy.

4.2 Feature Extraction

Feature extraction is a crucial element in diabetes research because it helps to increase the accuracy of the ML model and minimizes the complexity of the design. It involves choosing the most pertinent attributes from the dataset that have a positive link with the occurrence of diabetes. This helps reduce the dataset’s dimensionality by lowering the data noise and redundancy. Two feature extraction algorithms were proposed in this system: principal component analysis (PCA) and the Boruta algorithm.

4.2.1 Principal Component Analysis

PCA is a well-liked feature selection technique in data analysis and ML. The reason for using this PCA method in this work is that it is a popular dimensionality reduction technique, which creates uncorrelated attributes that effectively define the class [32, 33]. It works by identifying the principal component of the dataset. These components are linear combinations of the original attributes, and they are ranked according to their ability to capture the most variance. The most variation is attained by the first main component, followed by the second and so on. To apply PCA to a database, first, the data should be standardized so that all the features have a uniform scale.

In mean centering, \({A}_{\text{cen}}\) is estimated via a mean of each attribute (1),

$${A}_{\text{cen}}=A-\text{mean} (A)$$
(1)

where x indicates the attributes, n indicates the data, and A indicates the matrix of the dataset. The covariance matrix of the standardized data is then estimated, which shows how different features differ from one another. The covariance matrix CO is estimated via Eq. (2):

$$\text{CO}= \frac{1}{n-1}{A}_{\text{cen}}^{T}{A}_{\text{cen}}$$
(2)

The primary components of this covariance matrix are represented by eigenvectors, and the variance explained by each component is represented by the associated eigenvalues in Eq. (3):

$$\text{CO}{e}_{i}={Y}_{i}{e}_{i}$$
(3)

where \({e}_{i}\) represents eigen vectors and where \({Y}_{i}\) indicates the eigen values. By keeping only the top k components, we can minimize the dimensionality of the databases from n to k, where k is often significantly less than n.

The steps outlined below must be performed to carry out the PCA approach. The step-by-step process of PCA is presented below.

Input: Diabetes dataset.

Output: High BP, BMI, heart disease, sex, age, pregnancies, and fruits.

  1. 1.

    Training data are used to compute the PCA correlation matrix.

  2. 2.

    Eigen values are estimated by determining det(∑− Yi).

  3. 3.

    Eigen vectors are estimated by solving \(\sum {e}_{i}={Y}_{i}{e}_{i}\)

  4. 4.

    Select the eigenvectors that belong to the first k most significant eigenvalues.

  5. 5.

    Determine the impact of the PCA result.

4.2.2 Boruta Algorithm

Boruta is a feature extraction technique used for finding pertinent features from a dataset. The Boruta algorithm is preferred in this research because, unlike other feature selection methods, these feature selection methods select the set of attributes that are well suited for the ML model. It is a popular feature selection algorithm because it can be applied for any ML classifier, handle noisy and complex data, and is easy to implement [34]. The RF algorithm serves as the foundation for the Boruta algorithm and was created by Kursa and Rudnicki. By separately permuting the values of each attribute, Boruta creates “shadow” features that are compared to the importance of the original attributes in the database. The relevance of the shadow characteristics, which indicate the data’s noise, is utilized as a benchmark for comparison.

The steps of the Boruta model are as follows:

Input: Diabetes dataset.

Output: High BP, BP, high Chol, BMI, heart disease, physical health, sex, age, pregnancies, and glucose2.

  1. 1.

    Generate a new matrix of features. Every feature of the matrix, A, is used to create the shadow attribute matrix A_S. Bind the shaded matrix A_S to the original matrix, A, to generate a new matrix \({A}_{n}\). \({A}_{n}\) is given in Eq. (4),

$${A}_{n}=[A, {A}_{n}]$$
(4)
  1. 2.

    The outcome of the feature selection method is trained, and the model uses the new matrix, An.

  2. 3.

    Calculate the Z_Score of the highest shadow attribute, Smax, for the matrix, M, and the new feature matrix, M_S.

  3. 4.

    Find the significant and insignificant feature by checking the condition. Z_Score > Smax is taken as a significant feature, and Z_Score < Smax is taken as an insignificant feature. Calculate the Z_Score of the highest shadow attribute, Smax, for the matrix, M, and the new feature matrix, M_S.

  4. 5.

    All the shadow attributes are eliminated.

  5. 6.

    All the above steps are repeated until all the significant features are selected.

4.3 Machine Learning Models

4.3.1 Light Gradient Boost

Light gradient boosting is a famous ML technique that is frequently utilized for classification applications. It operates on the basis of the combination of a tree-based learning method and a gradient boosting model. The LightGBM classifier identifies a function f(x) that maps the attributes x to the target y. This model is chosen for this framework because of its efficiency and power. The quantity of data that need to be processed during each iteration of the gradient boosting algorithm is decreased via a cutting-edge method called gradient-based one-sided sampling (GOSS). The numerical expression of GOSS is expressed in (5)–(9),

$${G}_{j}\left(d\right)=\frac{1}{n}\left\{\left\{\frac{\left(\sum_{{x}_{i}\in S{1}_{l}}{g}_{i}+\frac{1-{c}^{1}}{{c}^{2}}\sum_{{x}_{i}\in S{2}_{l}}{g}_{i} \right)}{{{n}_{l}}^{j}(d)}\right\}+\left\{{\frac{\left(\sum_{{x}_{i}\in S{1}_{r}}{g}_{i}+\frac{1-{c}^{1}}{{c}^{2}}\sum_{{x}_{i}\in S{2}_{r}}{g}_{i} \right)}{{{n}_{r}}^{j}(d)}}^{2}\right\}\right\}$$
(5)

where \({G}_{j}\left(d\right)\) calculates the variance gain via S1 U S2:

$${S1}_{l}= \left\{{x}_{i}\in S1: {x}_{ij }\le d\right\}$$
(6)
$${S1}_{r}= \left\{{x}_{i}\in S1: {x}_{ij }\le d\right\}$$
(7)
$${S2}_{l}= \left\{{x}_{i}\in S2: {x}_{ij }\le d\right\}$$
(8)
$${S2}_{l}= \left\{{x}_{i}\in S2: {x}_{ij }\le d\right\}$$
(9)

This method enables LightGBM to outperform well-known gradient boosting algorithms such as XGBoost and CatBoost in terms of accuracy and training time. The expression of objective function (10) used in the LightGBM algorithm is as follows:

$$\text{Obj}\left(\theta \right)=\sum_{i=1}^{n}l({y}_{i},{y}_{i}^{a(t-1)})+\sum_{j=1}^{T}{f}_{j}\left({x}_{i},{\theta }_{j}\right)+\sum_{j=1}^{T}\Omega ({f}_{j})$$
(10)

where θ is the set of variables that the model learns through training, T is the total number of trees, l is the loss function, and \({y}_{i}\) is the true label/outcome. This algorithm uses a gradient boosting method to build a hybrid DT to approximate f(x). The ensemble of DTs was trained using input features to perform classification. The prediction of DT (PT) and the predicted values of DT (Pv) are mathematically given in (11)–(12) as follows:

$${P}_{T}=\sum_{j=1}^{T}\Omega ({f}_{j})$$
(11)
$${P}_{v}=\sum_{i=1}^{n}l({y}_{i},{y}_{i}^{\wedge(t-1)})+\sum_{j=1}^{T}{f}_{j}\left({x}_{i},{\theta }_{j}\right)-{y}_{i}$$
(12)

The loss function of the light gradient boost machine can be estimated via Eq. (13):

$$f(l)=\sum_{i=1}^{n}l({y}_{i},{y}_{i}^{\wedge(t-1)})+\sum_{j=1}^{T}{f}_{j}\left({x}_{i},{\theta }_{j}\right)-{y}_{i}$$
(13)

where i indicates the given data record.

4.3.2 Gradient Boost Classifier

The gradient boosting classifier works by constructing a set of DTs, where each and every DT tries to set right the mistakes that occurred by the previous DT in the arrangement. The output of this algorithm is a summation of the predictions of all the DTs. This approach is the most powerful method and achieves high accuracy on a large dataset. In this algorithm, proper tuning of various hyperparameters, such as the training rate, depth of the DT, and number of DTs required for the sequence, is necessary to achieve high accuracy. The training data D in Eq. (14) are represented as follows:

$$D={\left\{{x}_{i},{y}_{i}\right\}}_{1}^{N}$$
(14)

The aim of the gradient boost classifier is to reduce the value of the loss function, F(l). This classifier algorithm constructs an approximation function on the basis of Eqs. (15)–(16):

$${F}_{q}\left(x\right)={F}_{q-1}+{{Q}_{q}H}_{q}\left(x\right)$$
(15)

where Qq denotes the weight of the qth approximation function, Hq(x):

$$\left({Q}_{q},{H}_{q}\left(\text{x}\right)\right)={Q,H}^{\text{argmin}}\sum_{i=1}^{N}f\left({y}_{i},{F}_{q-1}\left({x}_{i}\right)+QH({x}_{i})\right)$$
(16)

Rather than solving the optimization issue directly, Hq can be trained via Eq. (17) as follows:

$$D={\left\{{x}_{i},{p}_{oi}\right\}}_{i=1}^{N}$$
(17)

where poi (18) is a pseudoresidual and can be estimated via Eq. (18):

$${p}_{oi}=\left|\frac{\partial f({y}_{i,}F\left(x\right))}{\partial F(x)}\right|$$
(18)

The prediction of the gradient boost classifier is based on the following expression Eq. (19):

$${y}^{\wedge}\left(x\right)=\sum_{t=1}^{T}{\gamma }_{t}{P}_{T}(x;{\ominus}_{t})$$
(19)

where T represents the total trees of the system, Ƴt represents the shrinkage factor/learning rate, and PT represents the prediction of the generated tree on the basis of its parameters, ⊝t.

4.3.3 Random Forest

THE RF classifier is a type of ML algorithm that is used for classification purposes. It is a hybrid ML method that combines many more DTs for making predictions. This number of randomized trees together form (20)–(21):

$${R}_{s}(x,{D}_{s})={E}^{\wedge} \ominus \left[{R}_{s}(x, \ominus{,D}_{s})\right]$$
(20)

where ⊝ denotes the randomizing variable and Ds represents the dataset:

$${R}_{s}(x, \ominus)=\left[\frac{\sum_{i=1}^{n}{{y}_{i}1}_{\left({x}_{i}\in {B}_{n}\left(x,\ominus \right)\right)}}{\sum_{i=1}^{n}{1}_{\left({x}_{i}\in {B}_{n}\left(x, \ominus \right)\right)}}\right]1{E}_{s}(x, \ominus)$$
(21)

Event (22) can be expressed by the following expression:

$$E={E}^{\wedge} \ominus \left[{R}_{s}(x, \ominus {,D}_{s})\right]$$
(22)

where Bn(x, ⊝) represents the rectangular cell of the randomized tree Eq. (23):

$${E}^{\wedge} \ominus \left[{R}_{s}\left(x, \ominus \right)\right]={E}^{\wedge}-{\left[\frac{\sum_{i=1}^{n}{{y}_{i}1}_{\left({x}_{i}\in {B}_{n}\left(x, \ominus \right)\right)}}{\sum_{i=1}^{n}{1}_{\left({x}_{i}\in {B}_{n}\left(x, \ominus \right)\right)}}\right]}_{1{E}_{s}(x,-)}$$
(23)

In this method, a set of attributes are randomly chosen. The DTs in this classifier use these randomly sampled attributes to train each tree. The random forest algorithm produces a prediction output by collecting a vote from all the DTs. The expression for the prediction of the RF algorithm Eq. (24) is as follows:

$${y}^{\wedge}\left(x\right)=\frac{1}{T}\sum_{i=1}^{T}{P}_{T}(x)$$
(24)

where T denotes the number of trees in the model and PT(x) is the prediction of the tree.

4.3.4 Decision Tree (DT)

The decision tree is essentially a flowchart that starts at the root and uses the input features to branch out a tree with several decisions. Each leaf node in the DT indicates an outcome, whereas each internal node represents a decision formed on the basis of the attribute.

Let us consider Eqs. (25)–(27) to split the nodes of the DT into useful functions:

$$F\left({D}_{T},x\right)=H\left({D}_{r}\right)-\sum_{z=1}^{m}\frac{{N}_{z}}{{N}_{r}}H({D}_{z})$$
(25)

where Dr and Dz are the root node and zth node, Nr represents the samples in the root node, and Nz denotes the samples in the zth node:

$$F\left({D}_{r},x\right)=H\left({D}_{r}\right)-\frac{{N}_{n}}{{N}_{r}}H\left({D}_{n}\right)-\frac{{N}_{r1}}{{N}_{r}}H\left({D}_{r1}\right)$$
(26)

where Dl1 and Dr1 are leaf nodes, and Nl1 and Nr1 are the total data in the left and right leaf nodes, respectively:

$${I}_{E}\left(w\right)=-\sum_{u=1}^{v}p\left(\frac{u}{w}\right){\text{log}}_{2}p\left(\frac{u}{w}\right)$$
(27)

where p(u/w) is the percentage of samples associated with a class and a node t. The Gini index, IGINI,, is given in Eq. (28):

$${I}_{\text{GINI}}(t)=1-\sum_{u=1}^{v}{p\left(\frac{u}{w}\right)}^{2}$$
(28)

The error measure can be expressed via Eq. (29):

$${I}_{\text{CE}}(t)=1-\text{max}\left\{p\left(\frac{u}{w}\right)\right\}$$
(29)

The process of constructing a DT involves choosing the best feature to divide the data at each internal node on the basis of an information gain. This process helps increase the accuracy of prediction of the system while reducing the complexity. The expression for the prediction made via the DT technique is given in Eq. (30):

$${y}^{\wedge}(x)=f(x; \ominus)$$
(30)

where f(x; ⊝) is the predicted function and f(x) is the predicted value.

4.3.5 Evaluation and Validation

The evaluation of ML models is an important step in the development phase. It is useful for assessing the model’s effectiveness and determining whether it is operating at the appropriate level of accuracy and dependability. Assessing the performance of the system allows us to determine its accuracy, precision, recall and many other metrics, which help to estimate the model’s efficacy. Evaluating several models can be useful for choosing the optimal model that satisfies a specific need. By assessing the model’s performance (31)–(33), we can identify problem areas and optimize the hyperparameters, architecture, or other elements that can enhance the model’s performance:

$$\text{Precision}=\frac{\text{True Positive}}{\text{True Positive}+\text{False Positive}}$$
(31)
$$\text{Recall}=\frac{\text{True Positive}}{\text{True Positive}+\text{False Negative}}$$
(32)
$$F1 \text{Score}=\frac{2\times \text{Precision}\times \text{Recall}}{\text{Precision}+\text{Recall}}$$
(33)

The random oversampling technique can result in overfitting if it is not used carefully. This is because the duplicate instances in the minority class can cause the classifier to be too biased toward the minority class, which results in poor performance on new data. To ensure that the classifier generalizes successfully to new data, it is crucial to combine random oversampling with additional approaches such as cross-validation and regularization.

The tenfold cross-validation approach is a powerful and efficient approach for evaluating the performance of ML models and is preferred because of its ability to minimize bias, make systematic use of data, provide a more accurate estimate of performance, and guarantee generalization to new data.

5 Performance Analysis

The proposed model is tested on 20% and 30% of the test data. Furthermore, two feature extraction algorithms, namely principal component analysis (PCA) and the Boruta algorithm, were introduced. The performance of the machine learning models was individually scrutinized under each condition via a confusion matrix. Table 3 shows that, for 30% of the PIDD testing data, the random forest (RF) algorithm, which employs the PCA feature selection method, achieves an accuracy of 89%, whereas the accuracy of the RF technique using the Boruta feature selection method is 91%. In contrast, the RF with the Boruta feature selection algorithm attains 90% accuracy, whereas the RF with the PCA feature selection technique only reaches 85% accuracy for 30% of the BRFSS data.

Table 3 30% testing data—performance analysis without feature extraction

Tables 3 and 4 describe the performance of the ML models for both the PIDD and BRFSS datasets without introducing feature extraction techniques for training data ratios of 70:30 and 80:20, respectively. The table shows that none of the ML models achieved better accuracy.

Table 4 20% testing data—performance analysis without feature extraction

According to Tables 5 and 6, the Boruta feature selection algorithm can provide results with greater accuracy than can the PCA feature selection method. In addition, the maximum accuracy of 94% for the PIDD dataset and 92% accuracy for the BRFSS dataset was attained via the random forest classifier. If the testing data were 20% for both datasets, the performance of the ML models was enhanced by 94% for PIDD (Fig. 5) and 92% for BRFSS (Fig. 6).

Table 5 30% testing data—performance analysis with feature extraction
Table 6 20% testing data—performance analysis with feature extraction
Fig. 5
figure 5

Performance on the PIMA dataset

Fig. 6
figure 6

Performance on the BRFSS dataset

The performance of the Ml models is better when the training:testing ratio is 80:20. When the data in the dataset are of good quality, it helps an ML model learn the data easily and achieves good accuracy even with 20% testing data. Compared with 70:30 ratios, 80:20 ratios are capable of efficiently learning the basic concepts and ideas of the data during the training phase. Good performance is attained via the combination of the Boruta feature selection algorithm and the RF algorithm model for both the PIDD and the BRFSS datasets.

This is because the Boruta feature selection method allows the selection of the most important features from the dataset, and the RF model helps to address the influential features and their complex associations with diabetes classification. This results in good accuracy in the classifications with the tenfold cross-validation technique.

To assess the performance of the ML models, a tenfold cross-validation approach [35] is also included in the proposed work. The data are initially split into 10 equal folds, after which the ML algorithm is free to perform training 10 times with various folds each time. The performance of this method is more reliable than that of a train‒test split [36].

By averaging the results of all ten iterations, the model’s effectiveness is evaluated. On the basis of the validation conducted via the tenfold cross-validation approach [37], the performance of the RF classifier model with the Boruta feature extraction method (Tables 7 and 8) [38] was validated. As a result, by integrating with the Boruta feature extraction approach, the RF classifier performs better.

Table 7 Ten-fold cross-validation results for PIDD
Table 8 Ten-fold cross-validation results for the BRFSS

Table 9 shows the comparative analysis of various existing methods with our proposed model on the PIMA dataset. The table shows that the proposed models are superior to other diabetes ML models that have been studied in the analysis. We have evaluated and validated this model, so it can be used for practical use to help physicians classify diabetes.

Table 9 Comparative analysis of proposed approach with State-of-the-art approaches

6 Conclusion and Future Scope

In this article, we suggest a supporting detection system based on an analysis of four ML algorithms in two separate datasets, in which data cleaning plays a major role in classification. The standard of the dataset was enhanced by missing data imputation, outlier detection and a random oversampling approach. In this context, data balancing was a focused concern, where the random-over-sampling technique was introduced to balance both datasets. Multiple measurement techniques, including accuracy and recall along with the F1 score, are compared to examine the performance of various ML methods. The obtained classification findings indicate that the random forest algorithm performs better and offers more accurate classification. However, some of the additional techniques used in this work also produce the most ideal results compared with other methods that are currently available in the literature. The primary aim of this research is to assist diabetologists in developing an accurate treatment regimen for patients who are diabetic. This work may pave the way for the creation of a digital healthcare system for diabetes patients because of its high precision, quick disease diagnosis, and quick treatment. This study has a few areas where it could be enhanced or developed in the future, such as the diagnosis of diabetes via deep learning and hybrid techniques and the creation of a solution using an Android app to assist people in determining whether they have diabetes. Some of the significant options that would enhance the performance of the research are to use clustering algorithms or genetic techniques as a computational model.

This research proposes an IoMT application for e-diagnosis technology for diabetes classification. Moreover, this model also contributes to the remote care and observation of patients with diabetes. Implementing the IoMT can simplify the process of collecting and evaluating data. In the future, novel automated and computerized methods with IoMT can be created to improve the classification of diabetes and other chronic diseases.

Future work could include testing more sophisticated data preprocessing methods, applying the model to different diseases, or integrating deep learning techniques to increase performance. Furthermore, discussing potential challenges in deploying these models in real-world environments, such as data privacy concerns, model interpretability, and scalability across different healthcare systems, and suggesting solutions, would add depth to the study. Addressing these practical considerations would make the research more comprehensive and valuable for both academic and practical applications.