Abstract
The utility of DNA sequencing in diagnosing and prognosis of diseases is vital for assessing the risk of genetic disorders, particularly for asymptomatic individuals with a genetic predisposition. Such diagnostic approaches are integral in guiding health and lifestyle decisions and preparing families with the necessary foreknowledge to anticipate potential genetic abnormalities. The present study explores implementing a define-by-run deep learning (DL) model optimized using the Tree-structured Parzen estimator algorithm to enhance the precision of genetic diagnostic tools. Unlike conventional models, the define-by-run model bolsters accuracy through dynamic adaptation to data during the learning process and iterative optimization of critical hyperparameters, such as layer count, neuron count per layer, learning rate, and batch size. Utilizing a diverse dataset comprising DNA sequences from two distinct groups: patients diagnosed with breast cancer and a control group of healthy individuals. The model showcased remarkable performance, with accuracy, precision, recall, F1-score, and area under the curve metrics reaching 0.871, 0.872, 0.871, 0.872, and 0.95, respectively, outperforming previous models. These findings underscore the significant potential of DL techniques in amplifying the accuracy of disease diagnosis and prognosis through DNA sequencing, indicating substantial advancements in personalized medicine and genetic counseling. Collectively, the findings of this investigation suggest that DL presents transformative potential in the landscape of genetic disorder diagnosis and management.
1 Introduction
As personalized medicine expands, genetic testing has been progressively recognized as a cornerstone of contemporary healthcare practices [1]. Among these, DNA sequencing, a process determining the precise order of nucleotides within a DNA molecule, is particularly interesting [2]. This technology allows for an in-depth understanding of genetic makeup and the roles specific genes or genetic variants may play in health and disease. It presents an unrivaled opportunity to evaluate an individual's risk of developing genetic disorders [3]. While the individuals concerned may not manifest symptoms, they may have inherited a genetic predisposition towards certain conditions, making DNA sequencing especially critical for this demographic [4]. Traditional diagnostic methodologies often demonstrate limitations, particularly their scope to deliver consistent accuracy and predictive capability [5]. Traditional diagnostic methods encompass a range of medical testing procedures used to identify diseases and conditions. These methods might include physical examinations, imaging tests like X-rays or MRIs, laboratory tests such as blood and urine tests, and biopsies. While these methods are fundamental and effective for diagnosing a wide range of diseases, they may have limitations when predicting genetic disorders or understanding an individual's genetic predisposition to certain conditions. Unlike DNA sequencing, traditional diagnostic methods may not provide a deep insight into the genetic makeup of individuals and might lack the predictive power to ascertain the risks of genetic disorders at an early or pre-symptomatic stage. In this light, the emerging field of personalized medicine, fueled by advancements in genetic testing and deep learning (DL), aims to transcend these limitations by offering a more precise, proactive, and personalized approach to disease diagnosis, prediction, and management. Especially in the emerging field of personalized medicine, genetics has become an integral component of modern healthcare practices [6].
Genetic testing is at the core of these advancements, a technique that has changed our ability to predict, identify, and treat a vast array of diseases [7]. The development of DNA sequencing, which determines the exact sequence of nucleotides in a DNA molecule, has been of immense assistance to genetic testing [8]. Using this technology, medical practitioners may know an individual's genetic makeup and forecast their likelihood of developing certain genetic diseases [9]. Despite the significance of these breakthroughs, conventional diagnostic procedures struggle to consistently provide high levels of accuracy and predictive power, signaling the need for a new technique [10].
DL, a sophisticated branch of machine learning (ML) [11,12], is rapidly gaining traction across various scientific domains, including genomics and personalized medicine [13]. The ability of DL algorithms to harness vast, diverse datasets enables them to uncover complex patterns and correlations that often remain undetected by conventional statistical methodologies [14,15]. Numerous genomic researchers have effectively used DL, identifying unique genetic variations linked to many disorders [16]. The typical training of DL models includes specifying preset hyperparameters, which may limit the models' adaptability to data. Table 1 summarizes the most recent published research on DNA sequences for diagnosis using ML techniques with their limitations.
DNA sequence | Class | |
---|---|---|
0 | TTAAAAGCAGGTTATATAGGCTAAATAGAACTAATCATTGTTTTAG… | 0 |
1 | TTAAAAGCAGGTTATATAGGCTAAATAGAACTAATCATTGTTTTAG… | 0 |
2 | TTAAAAGCAGGTTATATAGGCTAAATAGAACTAATCATTGTTTTAG… | 0 |
3 | TCAATGACTTTCTAGTAACTCAGCAGCATCTCAGGGCCAAAAATTT… | 0 |
4 | TCATAATGCTTGCTCTGATAGGAAAATGAGATCTACTGTTTTCCTT… | 0 |
… | … | … |
22,582 | AATCCCCAGTGTTGGAGGTAGGGCCCAGTGGGAGGTGTCTGAATCA… | 1 |
22,583 | TACAAAGAAATATCTGAGACTGGGTAATTTAGAAAGAAGAGAGGAT… | 1 |
22,584 | TTCTCACACTGCTATAAAGAAATACCTGAGACTGGGTAATTTATTA… | 1 |
22,585 | GGCTGTTCTTACTCTGCTATAAAGAAATACTGGAGAATGGGTAATT… | 1 |
22,586 | AAATATCTGACACTGGGTAATTTATAAAGAAAAGATGTTTAATTTG… | 1 |
Recent works in the domain of genetic testing and DNA sequence analysis have showcased a variety of computational and ML methodologies. For instance, a study by Ali et al. [17] proposed a computational method for accurately identifying DNA-binding proteins using a combination of split amino acid composition and position specific scoring matrix for feature extraction. Although effective, this model can become computationally intensive and slow for large-scale applications due to its use of several feature extraction and selection methods. On the other hand, research by Alakus and Baykara [18] employed a DL model, specifically a Bi-directional Long Short-Term Memory (BiLSTM) model, for DNA sequence classification. However, solely relying on the BiLSTM model and raw DNA sequences without various feature extraction operations might limit the performance.
Furthermore, a novel DL model named LegNet was introduced by Penzar et al. [19], showcasing its prowess in predicting promoter expression from DNA sequences. Despite its effectiveness, the training of this model is computationally intensive, requiring specific hardware and software configurations. In a different stride, the paper by Zhang et al. [20] unveiled FunGeneTyper, a DL framework aimed at protein-coding gene function prediction, yet lacked a detailed discussion on computational efficiency. Finally, research by Gunasekaran et al. [21] explored the use of various DL models like convolutional neural network (CNN), CNN-LSTM, and CNN-BiLSTM for DNA sequence classification. However, the manual tuning of hyperparameters in these models posed significant limitations.
As highlighted by these studies, the challenges posed by manual tuning of hyperparameters underscore the necessity for more efficient, automated methods in hyperparameter optimization, leading to the exploration of automated methods like grid search, random search, and Bayesian optimization to surmount the limitations of manual tuning, providing a segue into the define-by-run DL methodology employed in this research.
It can be concluded that the previous works are based on manual hyperparameter tuning.
Manual tuning of hyperparameters in DL models can have some following limitations:
Manual tuning involves a considerable amount of time and effort. Each set of hyperparameters must be separately tested, with the process repeated until satisfactory results are obtained [22].
Manual tuning requires a deep understanding of the model and how each hyperparameter affects it, which can be hard for people who do not have much experience in the field [23].
No guarantees of the optimal solution; even with much time and effort, discovering the optimum set of hyperparameters is not a certainty. When the search field is wide, it is simple to overlook the ideal answer [24].
Manual tuning is not scalable for models with a large number of hyperparameters. As hyperparameters rise, the tuning procedure becomes exponentially more difficult [25].
Manual tuning can induce bias, as the tuner's prior experiences or presumptions can impact decisions [26].
When manually tuning hyperparameters, there is a potential of overfitting the model to the training data, which might result in poor generalization performance on unseen data [27].
Automated methods like grid search, random search, and Bayesian optimization aim to overcome these limitations by systematically exploring the hyperparameter space and finding the optimal set in a more efficient and less biased manner [28].
A more adaptable method, defined-by-run DL, has been created [29]. The models based on define-by-run are characterized by their ability to dynamically and iteratively optimize model hyperparameters, which gives them a significant edge over traditional models in terms of performance and adaptability [25]. This approach allows the model to tune its hyperparameters repeatedly and dynamically based on the data, offering an advantage over conventional models. These hyperparameters include the number of layers, the number of neurons in each layer, the learning rate, and the batch size, among others. This adaptability enables the model to learn from and adapt to the data more effectively, hence enhancing the model's performance and precision.
This research employs a define-by-run DL model reinforced by applying the Tree-structured Parzen Estimator (TPE) algorithm to refine the application of DNA sequencing in disease diagnosis and prediction [30]. The TPE algorithm addresses hyperparameter optimization by building a probabilistic model of the objective function's conditional distributions, which enables more efficient hyperparameter space searches. This strategy has significantly enhanced the optimization procedure and overall performance of DL models. The model's distinguishing feature is its capacity for iterative optimization of critical hyperparameters throughout the learning process, enabling it to adjust dynamically to the data, thereby ensuring superior accuracy [31]. By leveraging a diverse dataset comprising DNA sequences from control and patient cohorts, the study aims to ascertain the efficacy of this innovative approach in enhancing the precision of disease prediction and diagnosis [32]. This research strives to contribute to the ongoing discourse on the transformative potential of DL in genetic disorder diagnosis and management [33].
The pivotal dataset encompasses DNA sequences from two disparate cohorts: individuals previously diagnosed with breast cancer and a control group comprising healthy subjects. The goal is to scrutinize the genetic distinctions, potentially revealing genetic markers or mutations correlative or causative of breast cancer. This DL model is architected to adapt during the learning voyage dynamically, iteratively optimizing critical hyperparameters to enhance its diagnostic acumen. The significance of this endeavor is profound, as early detection of breast cancer is paramount in elevating the success rate of therapeutic interventions and ameliorating patient prognoses. By leveraging DNA sequencing to unveil the genetic underpinnings of breast cancer, this investigative effort aims not only to augment diagnostic precision but also to furnish invaluable foresight, enabling timely medical and lifestyle interventions. The predictive prowess of this model, as demonstrated by its remarkable performance metrics, heralds a substantial stride towards the early discovery of breast cancer, thus potentially contributing to the broader objectives of personalized medicine and genetic counseling in the oncology domain.
The main contribution of this study can be summarized as follows:
This is the first define-by-run DL model created to categorize DNA sequences to the best of our knowledge.
Applying the define-by-run method in genomics provides a valuable case study for using this approach in other fields. It contributes to the broader field of ML by demonstrating the effectiveness of dynamic, data-driven model optimization.
The proposed define-by-run method's ability to dynamically adjust to data makes it a scalable solution for DNA sequence classification. As genomic datasets grow in size and complexity, this method's scalability becomes increasingly important.
This work is organized as follows: Section 2 introduces the dataset and the premise underlying the proposed approaches. Section 3 describes the results and discussion of the experiments. The conclusion of the study is given in Section 4.
2 Materials and methods
Algorithm 1 outlines a DL method for DNA sequence classification. It begins with the loading and preprocessing of DNA sequences into numerical vectors. A DL model is then defined and trained using a TPE for hyperparameter optimization. The motivation behind employing a “define-by-run” DL approach emanates from its inherent capability to dynamically and iteratively optimize critical model hyperparameters during the learning process. This dynamic adaptation is envisioned to enhance the model's accuracy and predictive power, crucial for the precise diagnosis and prognosis of breast cancer through DNA sequencing.
Furthermore, unlike conventional DL models, which necessitate preset hyperparameters, the “define-by-run” DL model allows for a more flexible, data-driven optimization of hyperparameters such as layer count, neuron count per layer, learning rate, and batch size. This adaptability is anticipated to enable the model to learn from and adapt to the data more effectively, amplifying its performance and precision in identifying genetic markers associated with breast cancer. Finally, in the context of DNA sequencing, the “define-by-run” DL model, reinforced by the TPE algorithm, aims to refine the application of DNA sequencing in disease diagnosis and prediction. This approach potentially transcends the limitations of manual hyperparameter tuning, which is often computationally intensive and prone to biases, thus offering a more efficient and robust method for analyzing complex genomic data pertinent to breast cancer.
Various metrics evaluate the model's performance on a test set to evaluate the proposed method. This process results in a model differentiating between control and patient DNA sequences. Figure 1 depicts the workflow associated with the proposed method, while Algorithm 1 explains DNA Sequence Classification with a define-by-run technique.
Algorithm 1: DNA sequence classification with define-by-run
Step | Operation | Details |
---|---|---|
1 | Load dataset | Fetch DNA sequences and corresponding labels (control or patient) from genomic databases |
2 | Preprocess the DNA sequences | a. Break down each DNA sequence into 3-mers |
b. Perform count vectorization on the 3-mers to generate numerical vectors | ||
3 | Split the dataset | Partition the dataset into a training set (80%) and test set (20%) |
4 | Define the DL model | a. Specify model structure (layers, neurons per layer) |
b. Define hyperparameters (learning rate, batch size) | ||
5 | Initialize TPE Algorithm | a. Specify the hyperparameters to be optimized |
b. Define the range of possible values for each hyperparameter | ||
6 | Train the DL model | a. Employ the define-by-run method |
b. Run TPE | ||
c. Use the train set to train the model | ||
d. Iteratively refine model's structure and hyperparameters based on model performance | ||
7 | Evaluate model performance on the test set | a. Compute accuracy, precision, recall, and F1 score |
b. Generate a confusion matrix |
2.1 Data acquisition
This research’s initial phase involved gathering a comprehensive dataset of DNA sequences from both control and breast cancer patient groups, including individuals diagnosed with or having a familial history of genetic disorders and those without such a history. The dataset, sourced from reputable genomics databases, ensured the sequences' validity and the associated metadata's consistency. Additionally, DNA/genomic sequences were collected from “The National Centre for Biotechnology Information” (https://www.ncbi.nlm.nih.gov), a public database of nucleotide sequences. The DNA sequence data were stored in FASTA files and ranged from 8 to 37,970 nucleoids. In terms of diversity, the dataset showcases a wide range of nucleoid lengths, which indicates the varied genetic information contained within the dataset. This diversity is crucial as it likely provides a rich source of genomic data for the DL model to learn from, thereby enhancing the model’s capacity to accurately classify DNA sequences and identify genetic markers associated with breast cancer.
The distribution of each class and the number of samples are shown in Figure 2. Sample DNA sequences from the collection are included in Table 1, along with class labels.
Ethical and privacy concerns are paramount in harnessing DNA sequencing data for disease prediction. These encompass informed consent, data confidentiality, and the potential for genetic discrimination. This study addressed these considerations by adhering to stringent data privacy protocols, anonymizing data to protect individual identities, and ensuring the ethical handling and secure storage of sensitive genomic information. Such measures uphold the ethical integrity of the study while safeguarding the privacy and rights of the individuals whose data were utilized for advancing the understanding and prediction of genetic diseases.
Using a diverse dataset, comprising DNA sequences from breast cancer patients and healthy control groups, is critical for evaluating the model’s performance. Such diversity ensures the model's robustness and generalizability across various genetic backgrounds, aiding in accurately identifying genetic markers associated with breast cancer. It enhances the model's diagnostic accuracy by differentiating cancerous and non-cancerous genetic patterns better. Including a control group provides a baseline for validation and benchmarking, representing real-world population diversity, which is essential for inclusive and unbiased findings. Through a comparative analysis, the model can develop nuanced predictive models, potentially leading to new insights regarding breast cancer's genetic basis, thereby improving the overall effectiveness of the model in real-world clinical settings.
2.2 Pre-processing
The DNA sequences were preprocessed before applying the DL model to ensure they were in a suitable format. The initial step in preprocessing involved the creation of k-mers from the sequences. A k-mer is a substring of a long sequence called “k” nucleotides. In this study, we focused on 3-mers, otherwise known as “triplets,” which have been demonstrated in prior literature to carry significant biological relevance, including protein folding and function determination. Generating 3-mers involved breaking down each DNA sequence into all possible substrings of length 3. For instance, the sequence “AGTCGA” would be transformed into the 3-mers “AGT,” “GTC,” “TCG,” and “CGA.” This approach was applied to all sequences in the dataset. The choice of 3 as the value for “k” allows for capturing enough complexity in the sequence data while keeping the computational burden manageable.
Following the generation of 3-mers, count vectorization was applied. This step converted the lists of 3-mers into vectors that the DL model can understand. Count vectorization creates a “dictionary” of unique 3-mers across all sequences. Each sequence was then represented as a vector, where each element in the vector corresponds to the count of a particular 3-mer from the “dictionary” in the sequence. For instance, if the “dictionary” of 3-mers is “AGT,” “GTC,” “TCG,” “CGA,” “GAT,” a sequence with the 3-mers [“AGT,” “GTC,” “GTC”] would be represented as the vector [1, 2, 0, 0, 0], reflecting the counts of each 3-mer from the “dictionary.” This transformation allowed the DL model to process the sequences effectively, mapping complex biological information into a numerical format that can be learned from, thereby enhancing the model's predictive capacity.
2.3 Define-by-run DL model
The define-by-run approach is a technique in learning that differs from the more commonly used define-and-run approach. In the define and run method, the network structure, including the number of layers and neurons in each layer, is predetermined before running the model. Once the model runs, its structure remains fixed and unchanged during the learning process.
On the contrary, with the define-by-run method, there is flexibility as the model's structure is not fully determined prior to training. Instead, during training, the model can dynamically adapt its structure. Adjust hyperparameters based on their learning from data and performance indicators being monitored. This dynamic adaptation and optimization of hyperparameters during training can potentially result in reliable models that better fit the training dataset. This approach also excels in handling data with dimensions, making it well-suited for complex tasks such as DNA sequence classification.
2.4 TPE algorithm
The TPE algorithm is a sequential model-based optimization algorithm for optimizing hyperparameters. Hyperparameters are model parameters that cannot be learned from data and must be established before training. These include the number of neural network layers, neurons in each layer, the learning rate, and the batch size. The TPE algorithm surpasses traditional grid search and random search approaches for hyperparameter optimization. It generates a probabilistic model that relates hyperparameters to the chance of achieving a certain objective function score. TPE proposes the next set of hyperparameters for evaluating the objective function using this model. TPE builds the probabilistic model utilizing two distributions: one for hyperparameters that generate positive outcomes on the objective function and another for those that provide poor results. The TPE algorithm adaptively modifies and refines the hyperparameter values as the optimization progresses, resulting in a more efficient search of the hyperparameter space. This adaptive technique accelerates convergence to optimal hyperparameters, resulting in a more optimized model. This study utilized the TPE technique to optimize the define-by-run DL model's hyperparameters during training, improving the model's performance on the DNA sequence classification problem. This project employed the TPE algorithm to dynamically and iteratively optimize crucial model hyperparameters during the learning process, a step facilitated by the DL model's “define-by-run” approach. The optimized key hyperparameters include the number of layers, the number of neurons per layer, the learning rate, and batch size among others. This dynamic optimization facilitated by TPE enhanced the model's accuracy, predictive power, and overall performance in identifying genetic markers associated with breast cancer. By integration with the define-by-run DL model, the “define-by-run” method in our DL model allows for a dynamic adaptation of model structure and hyperparameters based on data learning and performance indicators, which pairs well with the adaptive nature of the TPE algorithm. Through iterative training sessions, the TPE algorithm proposed new sets of hyperparameters, which were then evaluated, and the feedback was used to refine the probabilistic model of the TPE, thus progressively moving towards optimized hyperparameters. Finally, by employing the TPE algorithm, our methodology transcends the limitations of manual hyperparameter tuning, which can be computationally intensive and prone to biases. The TPE's ability to efficiently navigate the hyperparameter space accelerates the convergence to optimal hyperparameters, thereby potentially improving the model's performance on the DNA sequence classification task. The optimization of hyperparameters via TPE also aligns with the overall goal of enhancing the model's capacity to diagnose and prognosticate breast cancer through DNA sequencing precisely. Algorithm 2 explains the steps of using define-by-run DL with TPE hyperparameter optimization.
Algorithm 2: Define-by-Run DL with TPE hyperparameter optimization
Step | Operation | Details |
---|---|---|
1 | Define initial parameters | Define initial model parameters and hyperparameters |
2 | Initialize TPE algorithm | a. Specify the hyperparameters to be optimized. b. Define the range of possible values for each hyperparameter |
3 | Set iteration number | Set the number of define-by-run iterations (epochs) |
4 | Iterative process | For each iteration: a. Generate new hyperparameters using TPE. b. Build a new model instance with current hyperparameters. c. Train the model on training data. d. Evaluate the model on validation data, get validation loss. e. Provide validation loss to TPE as a result of current hyperparameters. f. Update the TPE model with the new result |
5 | Hyperparameter selection | After all iterations, choose the hyperparameters that yielded the lowest validation loss |
6 | Rebuild and retrain | Rebuild and retrain the model using the optimal hyperparameters on the entire dataset (training and validation data) |
7 | Evaluate optimized model | Evaluate the optimized model's performance on the test set |
8 | Return results | Return the optimized model, performance metrics, and optimal hyperparameters |
2.4.1 Hyperparameter optimization
To create the most useful model for our experiment, we optimized several hyperparameters:
Number of hidden layers: The CNN (CNN(hidden))'s number of layers was optimized. Increases in this number may result in lengthier training sessions and possible overfitting, while decreases too far may produce underfitting.
Number of kernels in each hidden layer: Each hidden layer's number was also optimized. Increases in the number of kernels, like those in the number of layers, can lengthen training times and increase the chance of overfitting, while decreases can result in underfitting.
Kernel size: Another parameter that was optimized was the size of the kernels utilized in the CNN layers.
Number of dense layers: The network's dense layers were optimized during the optimization phase. Changing this number has the same effect as the number of hidden layers.
Units within each dense layer: Each layer's number of neurons was also optimized.
Activation function for every layer: The activation function in each neural network layer introduces nonlinearity, enabling the model to learn complex patterns. During optimization, the choice of activation function, such as Rectified Linear Unit (ReLU) or hyperbolic tangent (tanh), significantly impacts the model's performance. The optimal choice depends on the problem's specifics and data characteristics.
Stride size in a CNN determines the step size for the convolution operation. A smaller stride captures more detailed features but increases computational complexity. Conversely, a larger stride reduces computational requirements but may miss some details. Therefore, optimizing stride size balances detail capture and computational efficiency.
Adaptive Moment Estimation (Adam) or Stochastic Gradient Descent (SGD) was chosen as the optimizer to apply in order to reduce the loss function. This decision may considerably impact the model's performance and rate of convergence.
In our work, we utilized the reduce learning rate and early stop callbacks to enhance the efficiency and performance of our DL model.
2.5 Callbacks
The reduced learning rate callback was used to adjust the learning rate dynamically during the training process. When the model's performance on the validation set stopped improving for a certain number of epochs (a period we defined as “patience”), the learning rate was automatically reduced, which helped the model to fine-tune the parameters and potentially escape local minima in the loss landscape, leading to improved performance. In addition, the earlystop callback was employed to prevent unnecessary computation and potential overfitting. The training process was automatically stopped if the model's performance on the validation set did not improve for a specified number of epochs (“patience”), which ensured that we did not continue to train the model past the point of optimal performance, saving computational resources and preventing the model from overfitting to the training data.
By using these callbacks, we were able to make the training process more adaptive and efficient, which likely contributed to the high performance of our model.
2.6 Model evaluation and performance metrics
Following the training, the performance of the DL model is evaluated using a set of common metrics for classification problems: accuracy, precision, recall, F1-score, and the confusion matrix. These metrics provide different perspectives on the model's performance and are based on the concepts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These values are often visualized in a confusion matrix, a table layout that allows visualization of the algorithm's performance.
2.6.1 Accuracy
Accuracy is the ratio of the correctly predicted observations (both positive and negative) to the total observations. It measures the model's overall ability to predict both classes correctly, but it can be misleading if the classes are imbalanced. It is calculated as follows:
2.6.2 Precision
Precision, or positive predictive value, measures the proportion of positive identifications that were actually correct. It is especially important when the cost of a FP is high. It is calculated as follows:
2.6.3 Recall
Recall, or sensitivity, measures the proportion of actual positives that were identified correctly. It is especially important when the cost of a FN is high. It is calculated as follows:
2.6.4 F1-score
The F1-score is a measure that combines precision and recall. The harmonic mean of precision and recall gives equal weight to both metrics. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good metric when the data have an imbalanced class distribution. It is calculated as 2 × (precision × recall) / (precision + recall).
2.6.5 Area under the curve (AUC)
AUC of the receiver operating characteristic (ROC) is a common performance metric used in ML, particularly for binary classification problems. The ROC curve is a plot that displays the TP rate against the FP rate at various threshold settings. The AUC measures the entire two-dimensional area underneath the ROC curve from (0,0) to (1,1).
The AUC value ranges from 0 to 1. A model whose predictions are 100% correct has an AUC of 1. Conversely, a model whose predictions are 100% wrong has an AUC of 0. An AUC of 0.5 is akin to random guessing (for instance, a coin flip).
The dataset was divided into a training set (80% of the data) and a test set (20% of the data). The model was trained on the training set, and its performance was evaluated on the test set using these metrics. The confusion matrix was generated based on the test set predictions to provide an overall view of how well the model could predict each class and where the misclassifications occurred (all the results are presented for the test data).
These metrics were chosen because they offer different perspectives on the model's performance. For instance, precision and recall are important for understanding the model's behavior in terms of FPs and FNs, which is crucial in medical diagnostic applications where misclassifications can have significant implications. On the other hand, the AUC-ROC provides a holistic view of the model's discriminatory power. Collectively, these metrics thoroughly evaluate the model’s performance, ensuring that it accurately and reliably identifies genetic markers associated with breast cancer.
2.7 Loss function
Categorical cross-entropy is a loss function used in multi-class classification tasks, where each example belongs to one of many possible categories. It quantifies the difference between two probability distributions: the predicted distribution from the model and the actual distribution. The loss for a single data point is calculated as follows:
Here
3 Results
This experiment established an initial learning rate of 0.001 for all weights. If validation data loss does not improve after one epoch, the learning rate is decreased by a factor of 0.5 using a lower learning rate with one patient. If, after two epochs, the validation loss of data has not improved, the early halting approach with patient two is utilized to reduce overfitting. The tests were conducted with a 1.7 GHz Intel Core i7 processor, 16 GB of RAM, Windows 11 Professional 64-bit, and a 4 GB NVIDIA GeForce graphics card. Python, the TensorFlow library, and the Optuna module, which generates the optimization algorithm, are used to develop the models.
The TPE algorithm optimizes the hyperparameters of the CNN model using the define-by-run method. The hyperparameters that allowed the model to achieve the best performance were estimated based on the optimization results. The optimal hyperparameters for each dataset are listed in Table 2, along with the search space for hyperparameters. After randomly initializing the hyperparameters, the TPE iterated 50 times to determine the model's recall while evaluating the hyperparameters. Consequently, the optimal CNN model hyperparameters can be determined when the classifier has the highest recall. The recall iteration convergence curve for each dataset utilized in this experiment is depicted in Figure 3.
Hyperparameters | Limit minimum | Limit maximum | Optimal value |
---|---|---|---|
The number of CNN layers | 1 | 3 | 3 |
The number of kernels in the CNN layers | [16,32,64] | 64 (all CNN layers) | |
Kernel size in the CNN layer | [3,5,7] | [7,7,5] (each CNN layer, accordingly) | |
The number of dense layers | 1 | 2 | 2 |
The number of dense layer units | 16 | 32 | [26,27] (each dense layer, accordingly) |
Stride | 1 | 2 | [1,2,2] (each CNN layer, accordingly) |
Activation function | [ReLU, tanh] | [tanh, ReLU, tanh] (each CNN layer, accordingly) | |
Optimizer | [Adam, SGD] | Adam |
The recall iteration convergence plot provides a visual representation of how the recall score of our DL model evolved during the optimization process. The x-axis represents the number of iterations, and the y-axis represents the recall score. Each point on the plot corresponds to the recall score of the model at a particular iteration. At the beginning of the optimization process, we observe that the recall score varies significantly, which is expected as the TPE algorithm explores different sets of hyperparameters to understand the hyperparameter space better. As the iterations progress, we notice that the recall score starts to converge, indicating that the TPE algorithm is beginning to exploit the regions of the hyperparameter space that it has found to yield high recall scores. The fluctuations in the recall score become less pronounced, and the algorithm starts to fine-tune the hyperparameters within these promising regions. By the end of the optimization process, the recall score has converged to a relatively stable value suggesting that the TPE algorithm has successfully identified a set of hyperparameters that yields a high recall score. The final model, trained with these optimized hyperparameters, is expected to perform well on unseen data, particularly in its ability to correctly identify positive cases, as indicated by the high recall score.
Figure 4 shows the importance of each hyperparameter. It is concluded that the most important hyperparameter is the kernel size, followed by the number of convolutional layers
Table 3 illustrates the performance metrics of the define-by-run technique's optimal model and the number of training epochs and training duration per epoch (based on test data).
Accuracy | Recall | Precision | F1-score | AUC | Loss | No. of epochs | Time per epoch in s |
---|---|---|---|---|---|---|---|
0.871 | 0.871 | 0.872 | 0.872 | 0.95 | 0.30 | 10 | 5 |
4 Discussion
Table 4 demonstrates that the accuracy of 0.871 for classifying DNA sequences into two categories – patient and control – is quite high, especially considering the intricacy of genomic data. This suggests that the model is reliable in determining whether a DNA molecule belongs to a patient (someone with a disease) or a control (someone without the disease) approximately 87.1% of the time. The performance measurements indicated by the define-by-run method for the optimal model sequence support this conclusion.
Dataset | Accuracy | Recall | Precision | F1-score | AUC | Loss | No. of epochs | Time per epoch in s |
---|---|---|---|---|---|---|---|---|
[17] | 0.862 | 0.84 | 0.832 | 0.82 | 0.937 | 0.49 | 10 | 16 |
[18] | 0.85 | 0.856 | 0.854 | 0.85 | 0.94 | 0.45 | 10 | 13 |
[19] | 0.843 | 0.834 | 0.833 | 0.835 | 0.93 | 0.56 | 10 | 8 |
[20] | 0.861 | 0.859 | 0.862 | 0.858 | 0.95 | 0.35 | 10 | 12 |
[21] | 0.852 | 0.848 | 0.861 | 0.856 | 0.945 | 0.56 | 10 | 10 |
Ours | 0.871 | 0.871 | 0.872 | 0.872 | 0.95 | 0.30 | 10 | 5 |
The precision of the model was determined to be 0.872, indicating that of all sequences of the model predicted as patients, 87.2% were indeed from patients. This high precision emphasizes the model’s ability to accurately identify patient sequences and minimize FPs, which is critical in avoiding unnecessary alarm or treatment.
The recall, or the model's sensitivity, was computed as 0.871, demonstrating that the model successfully detected 87.1% of all the actual patient sequences in the dataset. Such sensitivity is pivotal in medical diagnostics, where overlooking a patient's condition can lead to delayed or absent treatment, potentially resulting in severe consequences.
The F1-score, calculated as the harmonic mean of precision and recall, was determined to be 0.872. This score measures the model's balance between precision (minimizing FPs) and recall (minimizing FNs) and is particularly important in contexts where both these aspects are equally critical. An F1-score of 0.872 confirms that the model achieves a good equilibrium between identifying as many patients as possible and maintaining a low rate of false alarms.
An AUC of 0.95 indicates an excellent model. It means there is a 95% chance that the model will be able to distinguish between positive and negative class instances. This high AUC score suggests that the model has a high separability measure and can make highly accurate predictions.
The confusion matrix in Figure 5 offered another perspective to evaluate the model. It showed that most predictions were on the diagonal line, corresponding to correct classifications. This model generated fewer off-diagonal elements, which correspond to misclassifications, compared to previous models, validating the model's robustness.
The use of a define-by-run DL technique in combination with the TPE algorithm for hyperparameter optimization is responsible for the significantly high performance of this model. The model dynamically adapted to the training data, optimizing the hyperparameters during the learning process and improving performance. These promising results highlight the potential of using advanced DL techniques in genetic disorder prediction and diagnosis.
Based on the findings of this investigation, the role of DL and AI in genetic disorder diagnosis and management is poised for significant expansion. The define-by-run DL model's demonstrated ability to effectively classify DNA sequences lays a solid foundation for more sophisticated, real-time genetic analysis and personalized medicine. As the technology matures and integrates with clinical workflows, it will likely facilitate quicker, more accurate diagnoses and tailor treatments to individuals’ genetic profiles, potentially transforming the prognosis for many genetic disorders. Additionally, as data volumes grow and algorithms improve, we might evolve towards more holistic, AI-driven approaches that can analyze a wide spectrum of genomic, clinical, and environmental data to provide a more nuanced understanding of genetic disorders and their management. This progression would enhance personalized medicine and contribute to the broader scientific understanding of genetic diseases, potentially uncovering novel therapeutic targets and treatment strategies.
4.1 Comparison with the previous works
For a balanced comparison, using the same dataset and computational setup are crucial when comparing our work with previous studies. However, each previous study used unique datasets, which are not publicly accessible. To overcome this, we recreated the models proposed in the previous studies and trained them using our dataset. This way, we ensured a consistent basis for comparison across all models. The comparison results are illustrated in Table 4, which shows that our proposed method outperformed the previous methods.
5 Conclusion
Integrating DL and genetic sequencing promises to transform the future of disease diagnosis and prediction. Our study successfully employed a define-by-run DL model, enhanced by TPE hyperparameter optimization, to analyze DNA sequences for disease prediction. The resulting model surpassed previous approaches in accuracy, precision, recall, and F1-score, suggesting that our method can significantly improve the ability to predict genetic disorders.
Notably, the dynamic and iterative nature of the define-by-run approach, coupled with the robustness of TPE for hyperparameter optimization, demonstrates a considerable potential for enhancing DL model performance in genomics and beyond. The remarkable performance of the DL model underscores the potential of leveraging deep learning for precise genetic disease diagnosis, aiding in personalized medicine through early detection and tailored treatment plans, thereby contributing to improved patient outcomes.
As we advance into the era of personalized medicine, these findings underline the promise of DL in harnessing the wealth of information embedded within our DNA. One of the primary challenges in genomic analysis is the quality and availability of genomic data. Ensuring that the data are well-curated, labeled accurately, and sufficiently large to train robust models are essential. This challenge could be addressed by sourcing data from reputable genomic databases or consortia and applying rigorous data preprocessing and cleaning techniques to ensure data quality. Furthermore, the computational demands of DL models, especially when applied to large genomic datasets, can be significant, which was mitigated by employing efficient hardware resources. Further research will be beneficial in exploring more comprehensive datasets, integrating diverse types of genetic and non-genetic data, and adapting our approach to other domains of predictive analytics. Our study contributes significantly to this exciting journey, bringing us closer to unlocking our genetic code’s full potential.
Future studies could explore integrating additional genetic and non-genetic data types, further refining the model’s predictive capability. Also, given the dynamic nature of both the define-by-run approach and TPE optimization, the research could investigate the potential benefits of these methods in other fields of predictive analytics. On the other hand, the research needs to address some limitations, such as incorporating model interpretability techniques, using larger and more diverse datasets, and exploring other methods for hyperparameter optimization.
Acknowledgements
The authors would like to thank the University of Baghdad for their support in achieving this work.
-
Funding information: This research is not funded by any institution.
-
Author contributions: Conceptualization: A.T.H.A. and A.J.D.; methodology: R.K.M.; investigation: R.K.M.; resources: A.T.H.A.; data curation: R.K.M.; writing – original draft preparation: R.K.M.; writing – review and editing: A.J.D.; visualization, A.J.D; supervision: A.T.H.A. and A.J.D.; and project administration: R.K.M.; All authors have read and agreed to the published version of the manuscript.
-
Conflict of interest: The authors declare no conflict of interest.
-
Data availability statement: The dataset used in this work is available upon request to the corresponding author.
References
[1] Vogenberg FR, Isaacson Barash C, Pursel M. Personalized medicine: Part 1: Evolution and development into theranostics. Pharm Ther. 2010;35(10):560–76.Search in Google Scholar
[2] Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51.10.1038/nrg.2016.49Search in Google Scholar PubMed PubMed Central
[3] Metzker ML. Sequencing technologies – the next generation. Nat Rev Genet. 2010;11(1):31–46.10.1038/nrg2626Search in Google Scholar PubMed
[4] Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50(9):1219–24.10.1038/s41588-018-0183-zSearch in Google Scholar PubMed PubMed Central
[5] Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311(10):1035–45.10.1001/jama.2014.1717Search in Google Scholar PubMed PubMed Central
[6] Hamburg MA, Collins FS. The path to personalized medicine. N Engl J Med. 2010;363(4):301–4. 10.1056/NEJMp1006304. Epub 2010 Jun 15. Erratum in: N Engl J Med. 2010;363(11):1092.Search in Google Scholar PubMed
[7] Manolio TA, Chisholm RL, Ozenberger B, Roden DM, Williams MS, Wilson R, et al. Implementing genomic medicine in the clinic: The future is here. Genet Med. 2013;15(4):258–67.10.1038/gim.2012.157Search in Google Scholar PubMed PubMed Central
[8] Mardis ER. DNA sequencing technologies: 2006-2016. Nat Protoc. 2017;12(2):213–8.10.1038/nprot.2016.182Search in Google Scholar PubMed
[9] Green R,C, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. American College of Medical Genetics and Genomics. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med. 2013;15(7):565–74.10.1038/gim.2013.73Search in Google Scholar PubMed PubMed Central
[10] Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med. 2014;370(25):2418–25.10.1056/NEJMra1312543Search in Google Scholar PubMed
[11] Al-Janabi MIH, Alheeti KMA, Alaloosy AAKA. Detecting malicious behaviour for SANET based on artificial intelligence algorithms. International Conference on Information and Communication Technologie (ICICT) Basrah, Iraq; 2021. p. 185–90.10.1109/ICICT52195.2021.9568475Search in Google Scholar
[12] Al-Janabi AIA, Al-Janabi STSF, Al-Khateeb B. Image classification using convolution neural network-based hash encoding and particle swarm optimization. 2020 International Conference On Data Analytics For Business And Industry (ICDABI), Sakheer, Bahrain; 2020. p. 1–5.10.1109/ICDABI51230.2020.9325655Search in Google Scholar
[13] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. https://pubmed.ncbi.nlm.nih.gov/26017442/.10.1038/nature14539Search in Google Scholar PubMed
[14] Rawi AA, Elbashir MK, Ahmed AM. Classification of 27 heart abnormalities using 12-lead ECG signals with combined DL techniques. Bull Electr Eng Inform. 2023;12:2220–34.10.11591/beei.v12i4.4668Search in Google Scholar
[15] Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: Predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics. 2017;33(13):1930–6.10.1093/bioinformatics/btx105Search in Google Scholar PubMed
[16] Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet. 2019;51(1):12–8.10.1038/s41588-018-0295-5Search in Google Scholar PubMed
[17] Ali F, Kabir M, Arif M, Swati ZNK, Ullah Khan Z, Ullah M, et al. DBPPred-PDSD: ML approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space. Chemom Intell Lab Syst. 2018;182:21–30.10.1016/j.chemolab.2018.08.013Search in Google Scholar
[18] Alakus TB, Baykara M. Comparison of monkeypox and wart DNA sequences with deep learning model. Appl Sci. 2022;12(20):10216.10.3390/app122010216Search in Google Scholar
[19] Penzar D, Nogina D, Noskova E, Zinkevich A, Meshcheryakov G, Lando A, et al. LegNet: A best-in-class deep learning model for short DNA regulatory regions. Bioinformatics. 2023;39(8):btad457.10.1093/bioinformatics/btad457Search in Google Scholar PubMed PubMed Central
[20] Zhang G, Wang H, Zhang Z, Zhang L, Guo G, Yang J, et al. Ultra-accurate classification and discovery of functional protein-coding genes from microbiomes using FunGeneTyper: An expandable deep learning-based framework. bioRxiv; 2022. 10.1101/2022.12.28.522150.Search in Google Scholar
[21] Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A, Deepa Kanmani S, Venkatesan C, Suresh Gnana Dhas C. Analysis of DNA sequence classification using CNN and hybrid models. Comput Math Methods Med. 2021;2021:1835056.10.1155/2021/1835056Search in Google Scholar PubMed PubMed Central
[22] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J ML Res. 2012;13:281–305.Search in Google Scholar
[23] Hutter F, Hoos HH, Leyton-Brown K. Sequential model-based optimization for general algorithm configuration. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2011 10-TR-SMAC.pdf (ubc.ca).10.1007/978-3-642-25566-3_40Search in Google Scholar
[24] Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of ML algorithms supplementary materials. Proc. NIPS; 2012;2012:1–9. https://namhoonlee.github.io/courses/optml/rg/group-12.pdf.Search in Google Scholar
[25] Bergstra J, Yamins D, Cox DD. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In 30th International Conference on ML, ICML 2013; 2013. http://proceedings.mlr.press/v28/bergstra13.pdf.Search in Google Scholar
[26] Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J Mach Learn Res. 2018;18:1–52. https://arxiv.org/abs/1603.06560.Search in Google Scholar
[27] Cawley GC, Talbot NLC. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010;1:2079–107. https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf.Search in Google Scholar
[28] Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N. Taking the human out of the loop: A review of Bayesian optimization. Proc IEEE. 2016;104(1):148–75. https://ieeexplore.ieee.org/document/7352306.10.1109/JPROC.2015.2494218Search in Google Scholar
[29] Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2019. https://arxiv.org/abs/1907.10902.10.1145/3292500.3330701Search in Google Scholar
[30] Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD. Hyperopt: A Python library for model selection and hyperparameter optimization. Comput Sci Discov. 2015;8:014008.10.1088/1749-4699/8/1/014008Search in Google Scholar
[31] Yamins DL, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nat Neurosci. 2016;19:356–65.10.1038/nn.4244Search in Google Scholar PubMed
[32] Angermueller C, Pärnamaa T, Parts L, Stegle O. DL for computational biolog. Mol Syst Biol. 2016;12:878.10.15252/msb.20156651Search in Google Scholar PubMed PubMed Central
[33] Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56.10.1038/s41591-018-0300-7Search in Google Scholar PubMed
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.