[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
From Language Models to Medical Diagnoses: Assessing the Potential of GPT-4 and GPT-3.5-Turbo in Digital Health
Previous Article in Journal
ChatGPT: Transforming Healthcare with AI
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Deep Neural Networks Using a Micro Genetic Algorithm

by
Ricardo Landa
1,*,
David Tovias-Alanis
1 and
Gregorio Toscano
2
1
Tamaulipas Campus, Cinvestav, Cd. Victoria 87130, Mexico
2
Department of Electrical Engineering and Computer Science, The Catholic University of America, Washington, DC 20064, USA
*
Author to whom correspondence should be addressed.
AI 2024, 5(4), 2651-2679; https://doi.org/10.3390/ai5040127
Submission received: 27 October 2024 / Revised: 14 November 2024 / Accepted: 18 November 2024 / Published: 2 December 2024
(This article belongs to the Section AI Systems: Theory and Applications)
Figure 1
<p>General scheme of the proposed method. Top: CNN model trained on a source domain. Center: Transfer learning of the pre-trained model parameters and tuning of the model weights of a DNN (FC layers) using the <math display="inline"><semantics> <mi>μ</mi> </semantics></math>GA-DNN algorithm. Bottom: Schematic illustrating the operation of the proposed method.</p> ">
Figure 2
<p>Example of an FC layer architecture: The input layer is related to the <span class="html-italic">d</span> features automatically extracted by the convolutional layers of the CNN model. In this architecture, there are <span class="html-italic">m</span> hidden layers, where <math display="inline"><semantics> <mrow> <msub> <mi>L</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>L</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub> <mi>L</mi> <mi>m</mi> </msub> </mrow> </semantics></math> indicate the number of neurons in each. Finally, the output layer provides a response (prediction) <math display="inline"><semantics> <msub> <mi>z</mi> <mi>i</mi> </msub> </semantics></math>, with <math display="inline"><semantics> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mi>c</mi> </mrow> </semantics></math>, for each of the <span class="html-italic">c</span> classes of the input dataset.</p> ">
Figure 3
<p>Example of an FC layer architecture. The input layer is related to the features of the dataset. Thus, this scheme assumes that the problem has a dimensionality of <math display="inline"><semantics> <mrow> <mi>d</mi> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math>. Additionally, it has <math display="inline"><semantics> <mrow> <mi>m</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math> hidden layers, each with 512 neurons, and the output layer has <span class="html-italic">c</span> neurons, where <span class="html-italic">c</span> is the number of classes of the problem.</p> ">
Figure 4
<p>Example of a chromosome using the binary representation <math display="inline"><semantics> <mi>μ</mi> </semantics></math>GA-1, which is composed of three binary blocks representing the first hidden layer (<math display="inline"><semantics> <msub> <mi>L</mi> <mn>1</mn> </msub> </semantics></math>), the second hidden layer (<math display="inline"><semantics> <msub> <mi>L</mi> <mn>2</mn> </msub> </semantics></math>), and the learning rate (<math display="inline"><semantics> <msub> <mi>l</mi> <mi>r</mi> </msub> </semantics></math>).</p> ">
Figure 5
<p>Example of a chromosome that uses the binary representation <math display="inline"><semantics> <mi>μ</mi> </semantics></math>GA-FS, which is composed of 2051 binary blocks. The first two blocks consist of 9 bits each, the third of 17 bits, and the last 2048 blocks each consist of 1 bit.</p> ">
Figure 6
<p>Example of a chromosome using the binary representation <math display="inline"><semantics> <mi>μ</mi> </semantics></math>GA-MRMR, which is composed of four binary blocks.</p> ">
Figure 7
<p>Results of the <math display="inline"><semantics> <mi>ACC</mi> </semantics></math> indicator obtained by each method are shown using boxplots. The best values are indicated in bold. The mean of <math display="inline"><semantics> <mi>ACC</mi> </semantics></math> and the <span class="html-italic">p</span>-value of the Wilcoxon rank sum test are shown at the top of each plot.</p> ">
Figure 8
<p>Results of the mean <math display="inline"><semantics> <mi>FSR</mi> </semantics></math> obtained by each method. The top part shows the <span class="html-italic">p</span>-value of the Wilcoxon rank sum test obtained by comparing each method with <math display="inline"><semantics> <msup> <mi>F</mi> <mi>R</mi> </msup> </semantics></math> (the best variant of the proposed method in terms of this indicator). In bold, <math display="inline"><semantics> <mrow> <mi>p</mi> <mo>&lt;</mo> <mn>0.05</mn> </mrow> </semantics></math>.</p> ">
Figure 9
<p>Results of the mean <math display="inline"><semantics> <mi>HNR</mi> </semantics></math> obtained by each method. The top part shows the <span class="html-italic">p</span>-value of the Wilcoxon rank sum test obtained by comparing each method with <math display="inline"><semantics> <msup> <mi>F</mi> <mi>R</mi> </msup> </semantics></math> (the best variant of the proposed method in terms of this indicator). In bold, <math display="inline"><semantics> <mrow> <mi>p</mi> <mo>&lt;</mo> <mn>0.05</mn> </mrow> </semantics></math>.</p> ">
Figure 10
<p>Results of the average <math display="inline"><semantics> <mi>MC</mi> </semantics></math> obtained by each method. The best values are indicated in bold.</p> ">
Figure 11
<p>Results of the average number of evaluations of the objective function obtained by each method. The best values are indicated in bold.</p> ">
Figure 12
<p>Results of the average runtime (in seconds) of each method.</p> ">
Figure 13
<p>Comparison of classification accuracy for the proposed models and the complete reference model.</p> ">
Versions Notes

Abstract

:
This work proposes the use of a micro genetic algorithm to optimize the architecture of fully connected layers in convolutional neural networks, with the aim of reducing model complexity without sacrificing performance. Our approach applies the paradigm of transfer learning, enabling training without the need for extensive datasets. A micro genetic algorithm requires fewer computational resources due to its reduced population size, while still preserving a substantial degree of the search capabilities found in algorithms with larger populations. By exploring different representations and objective functions, including classification accuracy, hidden neuron ratio, minimum redundancy, and maximum relevance for feature selection, eight algorithmic variants were developed, with six variants performing both hidden layers reduction and feature-selection tasks. Experimental results indicate that the proposed algorithm effectively reduces the architecture of the fully connected layers in the convolutional neural network. The variant achieving the best reduction used only 44% of the convolutional features in the input layer, and only 9.7% of neurons in the hidden layers, without negatively impacting (statistically confirmed) classification accuracy when compared to a network model based on a full reference architecture and a representative method from the literature.

1. Introduction

Machine learning is a branch of artificial intelligence dedicated to the development of specialized algorithms for constructing complex mathematical models from large volumes of data [1].
Traditionally, feature extraction for various machine learning tasks has been performed through human engineering, where a designer manually identifies a set of descriptors for a specific problem (hand-crafted features). However, in the paradigm known as deep learning, the set of descriptors is automatically learned. Thus, deep learning is a branch of machine learning in which a computer learns through hierarchies of concepts at distinct levels of abstraction. This approach reduces the need for domain experts to manually extract attributes from datasets [2].
In recent years, deep learning models have achieved high performance on complex classification problems [3]. In particular, the convolutional neural network (CNN) represents the most widely used deep learning model to address computer vision tasks. Examples of such applications include object detection, semantic segmentation, and medical image classification, among others [4,5,6].
A CNN is an artificial neural network (ANN) designed to process data with a grid-like topology (e.g., images). This type of network is composed of two main parts: the first is a feature (attribute) extractor that performs convolution and pooling operations on the input data. The second part consists of one or more fully connected (FC) layers, in which neurons connect to all neurons in the previous layer to solve a supervised learning problem (e.g., classification).
Learning in CNNs requires a massive amount of training data, which poses several challenges. First, training a CNN architecture involves a high computational cost [7]. Additionally, data collection problems can also arise, which is a limitation for various real-world applications where a sufficiently large training dataset is often not available [8].

1.1. Transfer Learning

The paradigm known as transfer learning (TL) has been introduced in recent years to try to address these challenges. The main idea is that a model obtained to solve a classification problem in one domain can be applied to another different, but related, classification problem. In the context of CNN training, TL transfers the network parameters from the source domain to the target domain. This approach is effective in reducing the risk of overfitting in a CNN model trained on a small dataset [9]. This seeks to improve the CNN feature selector in the target domain, such that only the tuning of the parameters related to the FC layers is required. These parameters can be organized in two levels: the weights of the network connections and the parameters that refer to the model design, known as hyperparameters [10].
The optimal adjustment of the number of layers and neurons in each layer of the FC part of a CNN remains an open problem in the research field. Therefore, this task is commonly posed as an optimization problem, where the aim is to maximize the classification accuracy and minimize the complexity of the network architecture. The goal is to obtain a simpler network topology that enhances classification accuracy and reduces the computational cost in the training process [11].
The number of neurons in the input layer of the FC part of a CNN model corresponds to the number of attributes obtained from the convolutional layers. Feature selection (FS) seeks to reduce the space of input variables by selecting a subset of attributes that offer a better description of the problem. Thus, the objective is to maximize classification accuracy while minimizing the complexity of the network architecture by reducing the number of neurons in the input layer. Therefore, several methods have been proposed to adjust the number of neurons in this layer, performing the FS task to select those attributes that allow increasing the classification accuracy of the model [12].
Techniques designed to reduce the number of parameters in the architecture of a CNN, both in the convolutional layers and in the FC-type layers, are called pruning algorithms. These algorithms seek to reduce the complexity of the model by minimizing the total number of layers in the network and the number of neurons in each layer while attempting to avoid negative impacts on classification accuracy [13].

1.2. Optimization Algorithms

Genetic algorithms (GAs) are meta-heuristics inspired by Darwin’s theory of evolution, in which natural selection plays a crucial role. In GAs, this process is conceptualized as an optimization task. Given an environment (optimization problem) with a population of individuals (candidate solutions) competing to survive and reproduce, each individual is assigned a fitness value (its evaluation in the objective function). The higher the fitness, the greater the probability of survival and reproduction. Based on their fitness values, a selection of parents is made, and crossover and mutation are applied to generate offspring. These offspring then compete with their parents to become part of the population of the next generation [14].
The GA starts with a randomly generated population, where the population size is user-determined, typically ranging from 50 to 200 individuals. The fitness of each element of the population is evaluated, and new solutions are generated through the application of crossover and mutation. After applying these genetic operators, it is common to use a simple elitism scheme to preserve the best solution from each generation.
GAs have been shown to outperform algorithms based on a single solution, such as gradient-based methods, particularly when applied to challenging problems involving local optima or large plateaus. That is why, in specialized literature, there is an increasing use of GAs to perform pruning tasks in both the convolutional and FC layers of CNNs.
However, GAs are computationally intensive, mainly because of the large populations to be evaluated at each generation. This way, the high computational cost involved in evaluating a large population that encodes the hyperparameters related to convolutional layers has led to the use of pruning methods with GAs, transferring pre-trained parameters more efficiently [8]. Additionally, there have been proposals to keep the pre-trained model (i.e, the convolutional layers) intact and pruning only the FC-type layers with a GA, as in [13], where they also consider the optimization of the FS task.
Despite these advancements, the use of GAs remains computationally expensive. In addition to populations of up to 200 individuals, they encode variables from 3 to 12,416 dimensions [8,15]. For this reason, we propose a method based on a micro genetic algorithm ( μ GA) that uses only four individuals in the population [16]. Some theoretical studies suggest that a small population of even three or four individuals is sufficient to achieve convergence in an optimization problem [17]. However, practical implementations of this idea are scarce, with the μ GA being one of the most notable approaches. By using this algorithm, this proposal seeks to balance the global exploration capabilities of traditional GAs with the reduced computational burden of gradient-based numerical methods.
Our approach involves using a μ GA to simultaneously optimize both the performance (by adjusting the learning rate) and the architecture of the FC layers (by adjusting the number of input and hidden neurons) of a CNN. The primary indicators used are classification accuracy and complexity of the network architecture. Reducing the number of input neurons in the FC part of a CNN involves not only determining how many to include but also identifying which input features are most relevant for the model (i.e., performing the FS task). All of this is performed within the framework of the recently proposed TL paradigm.

2. State of the Art

A significant amount of research has been conducted on optimizing the architecture of traditional ANNs (i.e., networks that do not use convolutional layers). Some approaches focus on designing the input layer [15,18,19], others on the hidden layer [20,21,22], and some on optimizing both [23,24]. However, these methods do not involve tuning weights using GAs, since they primarily concentrate on the optimization of the network architecture and/or the adjustment of the parameters of the backpropagation-based training algorithms.
For CNN models, existing works are centered around TL, with approaches oriented to the design of convolutional layers [8,25,26,27,28]. These works address the tasks of optimizing network architecture and layer selection using GAs. However, optimizing the convolutional layers involves a considerably higher computational cost than optimizing the FC layers due to the large number of parameters.

2.1. Approaches with Transfer Learning That Optimize Fully Connected Layers

A recent trend in the literature is to optimize the FC layers while keeping the convolutional architecture intact from the source domain in TL. This approach is based on the assumption that the features extracted in the source domain are representative enough to support the effective design and training of FC layers in a given target domain. There are two main approaches in the literature using CNNs with TL that aim to optimize the FC layers.
On the one hand, Bibi et al. [29] proposed a content-based image retrieval method using a TL approach with a pre-trained CNN model, a variant of the standard GA, and an ANN model known as the extreme machine learning classifier.
On the other hand, Poyatos et al. [13] developed a method called EvoPruneDeepTL (evolutionary pruning model for deep neural networks based on transfer learning). This approach performs evolutionary pruning by using TL to import the weights of a pre-trained CNN and a steady-state GA that fine-tunes the architecture of the FC layers. Optimization criteria, including maximizing classification accuracy and minimizing model complexity, which depends on the number of neurons in the hidden layers, were considered. However, the objective function only optimizes classification performance, since this method employs a selection criterion that favors solutions with fewer neurons.
During the feature extraction stage, the weights of a pre-trained CNN model (ResNet50 [30]) are used. Additionally, two variants of the proposed method are presented, both using the same cost function. The first variant minimizes the number of neurons in the hidden layers, while the second performs pruning exclusively in the input layer, functioning as a feature selector.
The results demonstrated that the optimization process involving neuron pruning obtained better classification accuracy than pruning connections between neurons. Furthermore, this method achieved better performance in terms of both classification accuracy and model architectural complexity compared to the reference models adopted.
Our proposed approach also focuses on optimizing the FC layers using TL.

2.2. Limitations for Optimization Algorithms

Due to the large number of parameters in CNN models, training is only feasible with the use of GPUs. Depending on the architecture, a single run of the training algorithm can take days or even weeks [31]. As a result, methods that optimize the architecture of pre-trained CNN models are often limited to a small number of evaluations of the objective function. Alternatively, weight-tuning algorithms may be configured with a smaller training batch size and a limited number of epochs.
This limitation in the number of evaluations and configurations for the weight tuning algorithm can lead to issues such as lack of convergence, inadequate exploration of the search space, or, in general, lower-quality solutions. Therefore, it is crucial to find a balance between the quality of solutions and the computational burden in the evolutionary optimization process for these models.
Developing algorithms that accelerate the parameter training process of CNN models is of great relevance due to the significant time and computational resources required for optimization. In this paper, we propose the use of a μ GA with a small population size. This approach allows for a larger number of generations and achieves reliable performance in search space exploration while keeping computational effort bounded.
Using the μ GA in this context has several advantages, such as reducing the computational time and the number of objective function evaluations. By keeping the population small, the algorithm can focus on promising regions of the search space, allowing faster convergence to high-quality solutions. However, the risk of using small populations is stagnation or premature convergence. In this work, we adopted an elitist re-initialization mechanism that helps preserve the best solutions found so far and ensures that the algorithm keeps exploring the search space, thus preventing stagnation.
The approach proposed in this work seeks to balance computational efficiency and solution quality by optimizing the learning rate and the architecture in both input and hidden layers of CNN models. By combining the advantages of μ GA with the elitist re-initialization mechanism, we seek to obtain promising results in terms of performance and reduction in model complexity.
A comparison of previous works, together with our proposed approach, is summarized in Table 1. Several columns are informative for a full comparison, but we would like to highlight columns TL model, GA type Name, GA Pop., FS and HL, as they contain the novel features of our work. Our method, in the last line, proposes the use of TL (in contrast to the first eight approaches listed), together with a μ GA with a reduced population size of just four individuals, applied to the optimization of both the input (performing the FS task) and hidden layers (in contrast to the previous approaches using TL) while maintaining model performance.

3. A Deep Neural Network Based on a Micro Genetic Algorithm ( μ GA-DNN)

This section introduces the proposed algorithm, which consists of a μ GA with a population of four individuals to optimize a weighted objective function that considers two criteria: (1) maximizing the classification accuracy of a DNN model and (2) minimizing the complexity of the network architecture (FC layers of a CNN model).
To achieve this, three different representations in the binary domain have been designed to encode the individuals in the population, considering the number of hidden neurons, the learning rate for training the DNN model, and the features of the dataset to perform the FS task.
Additionally, four variants of the objective function were developed to guide the search process toward solutions that meet the mentioned optimization criteria. By combining the different representations with at least two objective functions, a total of eight variants of the proposed method were obtained. Figure 1 shows the general scheme of the proposed algorithm, named the Deep Neural Network based on a Micro Genetic Algorithm ( μ GA-DNN).
The μ GA-DNN algorithm seeks to find a balance between classification accuracy and complexity of the network architecture, thus allowing it to obtain efficient and effective DNN models for classification tasks. By combining different representations and objective functions, the proposed algorithm is designed to adapt to various scenarios and provide high-quality solutions. In summary, the proposed μ GA-DNN algorithm uses an evolutionary approach to optimize the architecture of the FC layers of a CNN model, the learning rate of the DNN model training algorithm, and the performance of the FS task.
In this work, we use a pre-trained CNN model, ResNet50 [30], which has been pre-trained on the ImageNet [32] dataset. This approach is based on the idea of TL, where the parameters of the pre-trained model are transferred and adapted to solve another problem, known as the target domain.
The μ GA algorithm is applied to tune the architecture and parameters of the DNN training algorithm, tailoring them to the specific requirements of the target domain, thereby improving the performance of the model classification tasks. Once an optimized architecture has been obtained, the model is trained with the target domain dataset to adjust the DNN parameters to the specific characteristics of the problem. The performance of the trained model is then evaluated using an independent test dataset, allowing one to obtain an estimate of the model’s actual performance in the classification task.
The use of a μ GA [16] promotes rapid convergence and is efficient in locating promising regions of the search space, accelerating the evaluation process of the entire population since it works with very few individuals. However, small populations cannot maintain diversity over multiple generations. To address this, a mechanism that restarts the population is included when diversity is compromised. Additionally, a simple elitism scheme is used that allows preserving the best individual in the population. This approach helps prevent convergence to local minima, maintaining the diversity of the population with the information of genes of the reset individuals.

3.1. Objective Functions

This section outlines the design of the four proposed objective functions.

3.1.1. Maximizing Classification Accuracy

The first objective function considers only the classification accuracy criterion of the DNN model, which is measured in terms of the metric ACC.
Definition 1 (Classification Accuracy (ACC)).
The hit rate that measures the classification performance of the network model on an independent test set of patterns [33]. The higher the accuracy, the better the performance of the model.
Therefore, the optimization problem is defined as
Maximize f 1 ( q ) = ACC s . t . q Ω min ( L i ) 25 , i = 1 , , m
where f 1 [ 0 , 1 ] is the objective function, q is the binary vector encoding a DNN architecture, and the learning rate value of the model training algorithm, while Ω is the set of feasible solutions. On the other hand, min ( L i ) 25 is a condition indicating the minimum number of neurons that the optimized architecture can have in the i-th hidden layer, which is denoted as L i . In this work, a minimum limit of 25 neurons is considered, which was established empirically.
In this work, we consider a reference architecture with m = 2 hidden layers to minimize the complexity of the DNN architecture. L i represents the total number of neurons in the i-th hidden layer, with  i = 1 , , m (Figure 2).
Since the optimization criterion is maximizing classification accuracy ( ACC ), an additional mechanism is needed to address the minimization of the complexity of the network architecture in cases of a tie. To carry this out, a selection rule is introduced, inspired by a constrained optimization work [34], where the objective function is always the dominant criterion and the constraint violation is secondary. This rule is used both in the selection of parents for crossover and in the elitism mechanism of μ GA and is as follows:
Definition 2 (Selection rule S ).
If the algorithm finds two or more solutions with identical classification accuracy, the solution with the fewest neurons in the hidden layers will be selected.
This rule S balances the search for solutions that maximize classification accuracy while minimizing the complexity of the network architecture. As a result, simpler and more efficient solutions can be found that provide a classification performance comparable to that of more complex architectures.

3.1.2. Maximizing Classification Accuracy and Percentage of Hidden Neurons Removed

The second objective function seeks to maximize classification accuracy and minimize the architectural complexity of the DNN. This is achieved through a linear combination of ACC and the performance measure known as the hidden neuron ratio (HNR).
Definition 3 (Hidden Neuron Ratio (HNR)).
The proportion of hidden neurons relative to the maximum number of hidden neurons in the baseline architecture. A lower value of HNR implies that more hidden neurons have been pruned, which in turn can reduce model complexity and help prevent overfitting.
In this scenario, a second criterion in the case of a tie is unnecessary, since a reduction in the number of hidden neurons is explicitly considered in the objective function by including the term HNR. Thus, the optimization problem is formulated as
Maximize f 2 ( q ) = w 1 · ACC + w 2 · ( 1 HNR ) s . t . q Ω min ( L i ) 25 , i = 1 , , m
where w 1 and w 2 are the weights of relevance of each optimization criterion. In this study, we use w 1 = w 2 = 0.5 , which indicate that both criteria, ACC and HNR , are equally important.
If f 2 1 , it means that the DNN architecture encoded in the solution q achieves high performance in terms of classification accuracy and obtains a low percentage of hidden neurons. Conversely, f 2 0 suggests poor performance in both aspects.

3.1.3. Maximizing Classification Accuracy and MRMR Feature Selection Criteria

For this objective, we employ the minimum-redundancy and maximum-relevance indicator, widely used in FS.
Definition 4 (Minimum Redundancy and Maximum Relevance (MRMR)).
The criterion that measures the interdependence between attributes (redundancy) and their association with target variables or class labels (relevance) [35].
The concept of feature redundancy refers to determining the attributes that have a high correlation with each other, often computed using the Pearson correlation coefficient and denoted as W. Feature relevance refers to those attributes that have the highest influence on the variability of class labels in a given dataset, computed via point biserial correlation and denoted as V.
The original authors [35] proposed two ways to combine these criteria: V W and V / W . In this work, we adopt a normalized version of the V W formulation. So a value near 0 means that the selected features have a high redundancy between each pair of attributes and a low relevance in relation to the class label variable. On the other hand, a value near 1 indicates that there is a low redundancy of information between the attributes and a high relevance of this information in terms of its link with the class label variable.
Thus, the third objective function considers maximizing both the classification accuracy ( ACC ) and the MRMR criterion:
Maximize f 3 ( q ) = w 1 · ACC + w 3 · MRMR s . t . q Ω min ( L i ) 25 , i = 1 , , m
Here, w 1 = w 3 = 0.5 , indicating that both ACC and MRMR are of equal importance in the optimization process.
Like the objective function f 1 (Equation (1)), f 3 does not explicitly consider the criterion for minimizing the complexity of the DNN architecture. Therefore, the selection rule described in Section 3.1.1 is also incorporated.
Thus, if  f 3 1 , this indicates that the DNN architecture encoded in the solution q achieves high performance in terms of ACC and selects a feature set that maximizes the MRMR criterion. On the other hand, if  f 3 0 , it represents the opposite.

3.1.4. Maximizing Classification Accuracy, Percentage of Hidden Neurons Removed, and MRMR Criterion

The fourth objective function performs a linear combination of ACC , HNR , and  MRMR to maximize the classification accuracy, minimize the complexity of the network architecture, and maximize the MRMR criterion simultaneously by selecting a subset of attributes. This optimization problem is defined as follows:
Maximize f 4 ( q ) = w 1 · ACC + w 2 · ( 1 HNR ) + w 3 · MRMR s . t . q Ω min ( L i ) 25 , i = 1 , , m
In this work, we adopted equal weights, w 1 = w 2 = w 3 = 0 . 33 ¯ , so ACC , HNR and MRMR are equally important.
Therefore, if  f 4 1 , it means that the network architecture encoded in the binary vector q achieves high performance in terms of classification accuracy, minimizes the number of hidden neurons, and selects a subset of features that maximize the MRMR criterion. Conversely, if  f 4 0 , it means the opposite.

3.2. Representations of Solutions

In this paper, three representations are proposed to encode the solutions of the μ GA-DNN algorithm. This problem is modeled as a search task in a binary space, with each solution encoding the learning rate of the training algorithm, the number of input neurons, and the number of hidden neurons in the network.
Figure 3 shows the network topology used as a reference, where the complexity is to be reduced. The architecture consists of 2048 neurons in the input layer and two hidden layers of 512 neurons each. The number of neurons in the input layer is due to the use of the pre-trained model, ResNet50 [30], which automatically extracts a total of 2048 descriptors from the dataset. The maximum number of neurons in the hidden layers was chosen according to one of the reference architectures [13].
Each of the proposed representations is used with at least two of the objective functions detailed in Section 3.1.

3.2.1. First Representation: μ GA-1

In the first representation, the actual value of the learning rate and the number of neurons in both hidden layers are encoded. It should be noted that in this design, the encoding of the neurons in the input layer is not considered, so the FS task is not performed.
Therefore, the chromosome used in the μ GA is a binary vector consisting of three parts or blocks, the first part encodes the number of neurons in the first hidden layer, the second part encodes the number of neurons in the second hidden layer, and the third part encodes the learning rate used in the DNN model training algorithm.
Decoding is carried out using a linear mapping rule [36], as in all the following representations, where the desired precision p r , together with the maximum x ( U ) and minimum x ( L ) values of the variable, determine the number of bits n b needed for each variable. Table 2 shows the mentioned data.
Figure 4 shows an example of a chromosome that uses the μ GA-1 representation to encode a binary vector q = [ q 1 , , q 35 ] with q i { 0 , 1 } for i = 1 , , 35 . This representation generates a search space with 2 35 possible solutions.

3.2.2. Second Representation: μ GA-FS

In the second representation, the number of hidden neurons, the continuous value of the learning rate, and a binary string representing a subset of features are explicitly encoded, indicating that the optimization process includes the FS task.
Thus, the chromosome used in the μ GA is a binary vector consisting of four parts or blocks. The first two parts correspond to the encoding of the number of neurons in the two hidden layers, the third part encodes the value of the learning rate, while the fourth part explicitly encodes the selected features. The aim is to reduce the complexity of the network architecture, also in terms of the input layer.
As a result, a bit is used to represent the possible selection of each of the 2048 features extracted by the pre-trained ResNet50 model, thus adding 2048 binary dimensions to the problem. Therefore, in the last 2048 blocks of the chromosome, if the value of a bit at a certain position is 1, it means that the feature is selected, while if it is 0, it indicates the opposite.
Table 3 shows the search space bounds for the blocks in this representation, along with the resulting number of bits, the minimum and maximum values, and the desired precision.
Figure 5 shows an example of a chromosome that uses the μ GA-FS representation to encode a binary vector q = [ q 1 , , q 2083 ] . Thus, this representation has a search space of 2 2083 solutions, which is considerably larger than the search space of μ GA-1.

3.2.3. Third Representation: μ GA-MRMR

In the third representation, the number of hidden neurons, the continuous value of the learning rate, and the number of predictive variables selected from a feature ordering based on the MRMR criterion are encoded.
Therefore, the chromosome used in the μ GA consists of a binary vector with four sections. The first two parts correspond to the encoding of the number of neurons in the two hidden layers of the network, the third section corresponds to the encoding of the learning rate value, and the fourth part encodes an integer value indicating the number of selected features.
Table 4 shows the search space bounds for the blocks in this representation, as well as the resulting number of bits, the minimum and maximum values, and the desired precision.
Figure 6 shows an example of a chromosome using the μ GA-MRMR representation to encode a binary vector q = [ q 1 , , q 46 ] . Thus, this representation has a search space of 2 46 solutions, which is larger than the search space of μ GA-1. However, it is much smaller than the search space of the μ GA-FS representation, although  μ GA-MRMR also considers the FS task with the help of an additional process.

3.3. Naming Variants of μ GA-DNN

From the different representations and objective functions, different variants are obtained. To name them, we use a subscript and a superscript. The variants of the proposed method that use the selection rule described in Section 3.1.1 are indicated with the subscript S , and those that do not consider this rule do not have the subscript.
Regarding the superscript, it is related to the FS task, and we have W, H and R. They indicate that the FS task is addressed from a Wrapper (W) approach, which minimizes the classification error; a Hybrid (H) approach, based on the feature selection criterion MRMR and the classification error; and a Ranking (R) approach, focused on selecting features from a ranking based on the criterion MRMR , respectively. Hence, the variants are F, F S (these two do not perform the FS task), F W , F S W , F H , F S H , F R , and  F S R . Table 5 shows the relationship between the three representations and the four proposed objective functions, whose combination generates the eight variants of μ GA-DNN.

3.4. Pseudocode of the Proposed Algorithm

Algorithm 1 shows the pseudocode of the proposed method. Four individuals are used in the population, each encoding the parameters of a DNN architecture and the learning rate of the algorithm for training the synaptic weights of the network. As input data, it receives the crossover probability, the number of generations, and the training and validation datasets necessary for the evaluation of the individual using a 5-fold cross-validation scheme.
Algorithm 1  μ GA-DNN
Require: 
Crossover probability ( p c ), number of generations ( g m a x ), training data ( Z t r a i n ) and validation data ( Z v a l ), search range of learning rate, search range of number of nodes of hidden layers
Ensure: 
Parameters of the best DNN individual λ b e s t and its fitness value f ( λ b e s t )
  1:
Initialize population of individuals randomly: Q ( 0 ) = { q 0 , 0 , , q 3 , 0 }
  2:
Evaluate fitness of Q ( 0 ) : { f ( q 0 , 0 ) , , f ( q 3 , 0 ) }
  3:
Identify the best individual of the current generation: q b e s t , 0
  4:
for  g = 1 to g m a x  do
  5:
   Select Q from Q ( g 1 ) with binary tournament and criterion S // Section 3.1.1
  6:
   Apply two-point crossover with probability p c to Q
  7:
   Evaluate fitness of Q : { f ( q 0 , g ) , , f ( q 3 , g ) }
  8:
   Apply elitism strategy considering the criterion S // Section 3.1.1
  9:
   Compute homogeneity of Q with Hamming distance
10:
   if  Q is homogeneous then
11:
     Reset population: Q = { q 0 , g , , q 3 , g }
12:
     Evaluate fitness of Q : { f ( q 0 , g ) , , f ( q 3 , g ) }
13:
     Apply elitism strategy considering the criterion S // Section 3.1.1
14:
   end if
15:
   Get new population: Q ( g ) Q
16:
   Identify the best individual of the current generation: q b e s t , g
17:
end for
18:
Decode the best solution obtained q b e s t to obtain λ b e s t
19:
return  λ b e s t and f ( λ b e s t )
Additionally, μ GA-DNN uses the selection rule S described in Section 3.1.1 to perform parent selection ( Q ) and when applying the elitism operator. This is important as it helps to select DNN solutions or architectures with fewer training parameters.
Algorithm 2 presents the evaluation process of a solution in the objective function. The different representations to build the architecture of a DNN (Section 3.2) are decoded in lines 2, 6 and 10. Likewise, lines 22, 24, 26 and 28 show the computation of the different objective functions described in Section 3.1.
Algorithm 2 Evaluating a solution q .
Require: 
Binary individual q = { q 1 , , q B } , training data ( Z t r a i n ) and validation data ( Z v a l ) split into k = 5 folds, parameters of the reference DNN architecture ( L 1 and L 2 ), DNN training algorithm configuration (optimizer, number of epochs and batch size), objective function weights vector ( w ), variant of the proposed method (v).
Ensure: 
Individual’s fitness: f
  1:
if  v { F S , F }   then
  2:
   Decode q to obtain λ = [ L 1 , L 2 , l r ]
  3:
   Build a DNN architecture from L 1 and L 2
  4:
    X t r a i n = Z t r a i n , X v a l = Z v a l // No feature selection is performed
  5:
else if  v { F S W , F W , F S H , F H }  then
  6:
   Decode q to obtain λ = [ L 1 , L 2 , l r , η ]
  7:
   Build a DNN architecture from L 1 , L 2 and η
  8:
   Select the features of Z t r a i n and Z v a l with η to obtain X t r a i n and X v a l
  9:
else if  v { F S R , F R }  then
10:
   Decode q to obtain λ = [ L 1 , L 2 , l r , b ]
11:
   Build a DNN architecture from L 1 , L 2 and b
12:
   Select the features of Z t r a i n and Z v a l with b to obtain X t r a i n and X v a l
13:
end if
14:
for  i = 1 to k do
15:
   Train with X t r a i n , i the DNN model of the obtained architecture
16:
   Validate the trained DNN model with X v a l , i : ACC v a l ( i )
17:
end for
18:
ACC = 1 k i = 1 k ACC v a l ( i )
19:
Compute HNR
20:
Compute MRMR
21:
if  v { F S , F S W , F S R } then
22:
    f = ACC
23:
else if  v { F , F W , F R }  then
24:
    f = w 1 · ACC + w 2 · HNR
25:
else if  v = F S H  then
26:
    f = w 1 · ACC + w 2 · MRMR
27:
else if  v = F H  then
28:
    f = w 1 · ACC + w 2 · HNR + w 3 · MRMR
29:
end if
30:
return f

4. Experimental Framework

In the present experimental framework, the experiments performed are described, and the performances of eight variants of the proposed algorithm are compared against two reference approaches.

4.1. Datasets

We employed twelve datasets used to solve image classification problems. These datasets cover different domains and problem types, from synthetic data to X-ray images. These data were collected from various sources (some have been previously used in the literature). Table 6 shows the characteristics of the datasets.
In this work, we used the TL paradigm to import the pre-trained CNN model ResNet50 [30] to automatically extract features from image datasets. The source domain of the pre-trained model is the ImageNet dataset, which contains fourteen million images [32].

4.2. Reference Methods

This section shows the two benchmark approaches that were compared with μ GA-DNN. These approaches are (1) a state-of-the-art method called EvoPruneDeepTL [13] and (2) a conventional DNN architecture.

4.2.1. EvoPruneDeepTL Algorithm

This consists of a steady-state GA that aims to find the subset of features that maximize the classification accuracy of a supervised learning model, which is based on an ANN whose architecture consists of an input layer with d = 2048 neurons, m = 1 hidden layer of the FC type, with L 1 = 512 neurons, and an output layer with c neurons. Additionally, this method uses the ResNet50 architecture to extract features from the datasets.
The binary representation of the solutions in the EvoPruneDeepTL algorithm allows exploring the search space for the features to be used in the ANN model. By using a binary string of length d = 2048 , up to 2 2048 combinations of features can be represented.
On the other hand, the objective function maximizes the classification performance of the model in terms of accuracy ( ACC ).
Table 7 presents the parameters used in the EvoPruneDeepTL algorithm for experimentation in this work. These parameters were taken from the original proposal [13]. However, it is important to mention that the number of epochs used here is lower than that used by the original authors (600 epochs) due to computational time limitations and to match the amount used by our proposal. EvoPruneDeepTL will be denoted as E S during the experiments.

4.2.2. DNN Architecture

A DNN architecture based on the network topology shown in Figure 3 was also adopted as a reference method for comparison. It has d = 2048 neurons in the input layer, m = 2 hidden layers (with L 1 = L 2 = 512 ), and c neurons in the output layer. This model was trained with a version of the backpropagation algorithm based on the Adam [49] optimizer, using a batch size of 32 and a total of 100 epochs. This network architecture was chosen because it is one of the architectures used in the previous reference work [13].

4.3. Parameters of the Proposed Algorithm

Table 8 summarizes the parameters used by the eight variants of the proposed μ GA-DNN in the experimentation stage. These variants are described in detail in Section 3.3. Note that the number of epochs is the same as the number used by EvoPruneDeepTL.

4.4. Resampling Method

A twice-repeated five-fold cross-validation method was employed to obtain a more accurate assessment of the performance of the proposed algorithms. This approach involves splitting the data into five subsets, performing cross-validation five times, and repeating the process twice, resulting in a total of ten independent experiments. Cross-validation is a commonly used technique to evaluate the performance of machine learning models, as it allows obtaining more reliable estimates of model performance on unseen data.
Using this validation method reduces the influence of chance introduced by splitting the data, thus providing a more robust assessment of the performance of the proposed algorithms [50].

5. Results and Comparisons

The presentation of the results is divided into two parts:
  • Results of the variants of the μ GA-DNN method and the EvoPruneDeepTL algorithm. In this first section, the experimental results of all the variants of the proposed method ( F S , F S W , F S H , F S R , F, F W , F H and F R ) are compared with the experimental results of E S .
  • Results of the best variant of each group of the proposed method. The variants are divided into two groups: algorithms that employ the selection rule ( F S , F S W , F S H , and F S R ) and those that do not (F, F W , F H , and F R ). These results are compared to those obtained by the model that was trained from the reference DNN based on the architecture in Figure 3.

5.1. Results of μ GA-DNN Variants and EvoPruneDeepTL

The results are presented and analyzed for several performance measures.

5.1.1. Classification Accuracy

Table 9 shows the results for the ACC indicator of the variants of the proposed method and the reference method E S for each of the datasets used. Additionally, it presents some statistics, namely the mean, standard deviation (STD), median, median absolute deviation (MAD), maximum, minimum, and the count of the highest values obtained by each algorithm. The results indicate that the variants of the proposed algorithm based on the selection rule ( S ) achieved a better performance than their counterparts that do not use the rule S . For example, the F S R method achieved the highest average-accuracy performance among the proposed techniques that use S ( ACC = 0.814 ), while its counterpart F R achieved a slightly lower value ( ACC = 0.804 ), being the technique that obtained the highest value among those that do not use S . Even though the reference method E S obtained a higher count on the best values across different datasets, it achieved an average classification accuracy similar to the variant F S R ( ACC = 0.814 ).
Figure 7 shows the distribution of ACC results for each of the compared algorithms using boxplots. The average value (mean) and its respective p-value corresponding to the Wilcoxon rank sum test ( α = 0.05 ) are printed on the top of each plot. In this case, all variants of the proposed algorithm obtained p > 0.05 . Therefore, there is no statistically significant difference between any variant of μ GA-DNN and the E S method in terms of ACC.

5.1.2. Feature Selected Ratio

This experiment reports the results of the FSR measure, which is described below.
Definition 5 (Feature Selected Ratio (FSR)).
The percentage of features selected by the evaluated methods. In the input layer, each feature is a neuron, so it also represents the percentage of input neurons.
Table 10 presents the FSR results for the variants of the proposed method and the reference method E S on each of the datasets used, along with the corresponding statistics.
The results show that the proposed variants that rank the features of each dataset according to the MRMR criterion, i.e., F S R and F R , achieved the lowest FSR . In particular, F R achieved the lowest ratio ( FSR = 0.443 ). On the other hand, the E S method had a higher ratio compared to all the proposed variants performing the feature-selection task, with a FSR = 0.659 .
Figure 8 shows the results for the FSR indicator. The top part shows the p-value of the Wilcoxon rank sum test ( α = 0.05 ) that was used to evaluate the statistical significance between E S and the variants of the proposed method with respect to F R (the variant that obtained the best results in terms of the FSR indicator). Thus, E S and most of the variants of the proposed method present a statistically significant difference with respect to F R ( p < 0.05 ), with F S R ( p = 0.46 ) being the only one that does not present this difference. This is because both F R and F S R use the same method to perform the FS task, i.e., an ordering based on the MRMR criterion. Furthermore, it is worth noting that all the variants of μ GA-DNN that conduct the FS task obtained better results than the reference method E S in terms of the indicator FSR .
These results indicate that the variants of the proposed method are more effective in reducing the number of selected features compared to E S .

5.1.3. Hidden Neurons Ratio

Table 11 presents the results of the HNR metric. The results indicate that the four variants of the proposed method that do not use the S rule achieved a lower average HNR compared to their counterparts that do use the rule. The F R variant achieved the best performance in this metric ( HNR = 0.097 ).
These lower values of HNR for the variants without S are because the value of HNR is considered in the objective functions employed by these methods, as shown in Equations (2) and (4). In these functions, equal weight is given to the classification performance in terms of ACC and to the architectural complexity of the hidden layers of the DNN in terms of HNR .
Figure 9 presents the results for the HNR metric. The top section shows the p-value of the Wilcoxon rank sum test ( α = 0.05 ) used to assess statistical significance between all the variants with respect to F R (the variant that performed best in terms of the HNR indicator).
The E S method keeps all neurons in the hidden layer, as it only performs the FS task. On the other hand, the variants of the proposed algorithm that employ the rule S achieved a significant reduction in the number of hidden neurons, even when the objective functions used by these variants do not include the HNR indicator in their formulation. F S W and F S H achieved a reduction of more than 60% ( HNR < 0.40 ), while F S and F S R achieved a reduction of more than 70% ( HNR = 0.298 ). As for the variants that do not use the rule S , they achieved a reduction of more than 80%, with F R obtaining the best result ( HNR = 0.097 ).
Finally, the F R variant presents a statistically significant difference compared to all other methods in the study, except for F ( p = 0.31 ). It should be noted that F and F R optimize the objective function f 2 (2) and use the μ GA-1 and μ GA-MRMR representations, respectively. These encoding methods use 35-bit and 46-bit binary vectors (Section 3.2.1 and Section 3.2.3). As a result, both F and F R explore search spaces of comparable size. On the other hand, although F W and F H present a statistically significant difference with respect to F R , they also achieved reduction ratios of over 80%. However, it is important to mention that these methods use the μ GA-FS representation, which employs a 2083-bit binary vector (Section 3.2.2), so the search space is much larger than that explored by F R . As a result, the F W and F H variants are more susceptible to obtaining local minima.

5.1.4. Model Complexity

Table 12 shows the results obtained in terms of the MC indicator for each variant and each dataset.
Definition 6 (Model Complexity (MC)).
The number of trainable parameters of the classification model. For a DNN classifier, the total number of synaptic weights in the network is taken into account. A lower value of MC implies a less complex model, which can help reduce training time.
The Equation (5) describes how the model complexity is calculated for the variants of the proposed algorithm.
MC = L 1 ( d + 1 ) + L 2 ( L 1 + 1 ) + c ( L 2 + 1 )
Again, d is the number of neurons in the input layer, L 1 and L 2 are the number of neurons in the hidden layers, and c is the number of neurons in the output layer. The model complexity of the ANN architecture obtained by EvoPruneDeepTL is calculated as follows:
MC = L 1 ( d + 1 ) + c ( L 1 + 1 )
The results indicate that the variants of the proposed algorithm that do not use S achieved a greater reduction in MC , with examples including the methods F ( MC = 1.1 × 105), F W ( MC = 8.5 × 104), F H ( MC = 9.3 × 104), and F R ( MC = 4.5 × 104), the latter having the lowest average MC and the lowest value in 11 out of 12 datasets.
Figure 10 shows a summary of the results described above, showing that the variants using S achieved a higher MC value, with E S achieving an even higher value than all variants of the proposed algorithm.

5.1.5. Number of Objective Function Evaluations

Table 13 shows the results for the number of evaluations performed by each variant. The mean results indicate that the variants that employ the μ GA-FS representation performed fewer evaluations on their respective objective function ( No . Eval . < 249.0 ). Notably, F S H used a lower number of evaluations on nine out of twelve datasets, and therefore, a lower mean value than its counterparts ( No . Eval = 244.2 ). Conversely, F R obtained the highest number of evaluations, with a difference of only eight evaluations ( No . Eval = 252.0 ).
Figure 11 illustrates the notable difference in the number of evaluations performed by the variants of the proposed method and the E S method. Here, the number of evaluations of E S was predefined ( No . Eval = 300 ), so this was the expected result.

5.1.6. Hamming Distance Results

Table 14 shows the results for the percentage of similarity of selected feature vectors in terms of the Hamming distance. Results for F S and F are excluded since the same feature vectors were used in all experiments. The results indicate that the variants using the μ GA-FS representation generally obtained a Hamming distance of 0.500 for F S W , F S R and F R and a Hamming distance of 0.499 for F W . Variants using the μ GA-MRMR representation obtained a smaller Hamming distance, such as F R with 0.288 and F S R with 0.282 . The algorithm E S obtained a Hamming distance of 0.391 , lower than the variants with μ GA-FS representation but higher than the variants with μ GA-MRMR representation. Therefore, F S R was the variant with the lowest Hamming distance value among all the methods. This is because F S R makes an ordering of the features and then tries to optimize the number of features to select, so it is quite common that some features are repeated, resulting in a lower Hamming distance value.

5.1.7. Runtime Results

Table 15 shows the average runtime (in seconds) of ten experiments corresponding to each dataset for each variant of the proposed method and the reference method E S . Figure 12 also shows the average runtime for each variant of the proposed algorithm and the algorithm E S .
It should be noted that these results are informational only, as the experiments were performed on different hardware, preventing a reliable comparison of the efficiency of these algorithms.

5.2. Comparative Results with the Full Reference Model

This section presents the comparison between the full reference model (DNN) and the proposed algorithm variants F S R and F R , which performed best in terms of ACC (see Section 5.1.1). Table 16 shows the ACC results for the F S R and F R variants and the DNN model for each dataset. The summary of results includes the statistics of the mean, STD, median, MAD, maximum, minimum, and the count of highest values obtained by each algorithm. Results indicate that DNN obtained a better performance ( ACC = 0.821 ) compared to the proposed variants F S R ( ACC = 0.814 ) and F R ( ACC = 0.804 ). However, Figure 13 shows that there is no statistically significant difference between the proposed variants and DNN according to the p-value obtained with the Wilcoxon rank sum statistical test ( α = 0.05 ).
The experiment confirms that both F S R and F R can achieve remarkably similar performance with respect to the full reference DNN model in terms of ACC , using a reduced number of features and neurons in the hidden layers (resulting in a lower value of MC ).
It is important to mention that minimizing the MC indicator aims to reduce the computational effort of the training algorithm used to tune neuron weights associated with the FC layers. Consequently, the training and classification times of the network will be decreased, since the number of basic mathematical operations will be reduced by having a network architecture that is composed of a smaller number of input and hidden neurons.

6. Conclusions

This paper presents μ GA-DNN, an evolutionary optimization method using a μ GA algorithm to tune the architecture of a DNN and the learning rate parameter of the model training algorithm.
The general framework of the proposed approach uses TL. This consists of using pre-tuned parameters of a CNN model trained on a source domain to automatically extract features from a different problem, such as image datasets from different domains. Then, the dataset from the target domain is used to tune the number of neurons in the input layer (FS task) and hidden layers of the DNN architecture, using a GA with a population with few individuals. Finally, the model is trained with the obtained architecture, and its performance is evaluated on a test dataset.
Four objective functions and three different representations of the solution in the binary domain are proposed, resulting in eight variants of the proposed algorithm.
In the first scenario (Section 5.1), the proposed method was compared with EvoPruneDeepTL ( E S ), using four criteria: (1) classification accuracy ( ACC ), (2) feature selected ratio ( FSR ), (3) hidden neurons ratio ( HNR ) and (4) model complexity ( MC ). Additional analysis includes the number of evaluations of the objective function and the value of the Hamming distance of the feature vectors used by each method. Runtime measurements (in seconds) of the different variants of the proposed algorithm and the E S method are provided for informational purposes only due to the variety of hardware configurations used in the experimentation stage.
The results showed that the variants of the proposed method did not present a statistically significant difference with respect to its counterpart in terms of the indicator ACC . Moreover, the results regarding the second criterion ( FSR ) showed that the variants of μ GA-DNN that perform the FS task outperformed the reference method E S in terms of this indicator. As a result, the statistical significance was compared with respect to F R (the best variant of the proposed algorithm regarding this criterion), and the results showed that only F S R did not present a statistically significant difference. Notably, both F S R and F R use the same method to perform the FS task (an ordering based on the MRMR criterion).
In terms of the indicator HNR , the variant F R performed best. Furthermore, the results of the statistical test indicated that only the F variant did not obtain a statistically significant difference with respect to F R . It is important to note that both F and F R are modeled in such a way that the size of the search space explored with both methods is not noticeably different, since they use a binary representation with a similar number of bits and the same objective function.
On the other hand, the results in terms of the MC indicator showed that the variants that do not use the selection rule ( S ) obtained the best performance. This is because these methods focus on explicitly reducing the number of neurons in the hidden layer by introducing the HNR indicator in their respective objective functions. Thus, a considerable reduction in the model architecture was achieved.
Furthermore, a comparison was made regarding the number of evaluations of the objective function. Even though the proposed method used fewer evaluations than the reference method, the μ GA-DNN variants achieved competitive results, even surpassing E S on some occasions in terms of the performance measures mentioned above.
Additionally, the value of the Hamming distance between the binary vectors indicating the features selected by the methods that perform the FS task was evaluated. The results showed that the variants of the proposed method achieved a greater reduction in this distance, so they are more robust in terms of the repeatability of the selected features; in particular, F R and F S R obtained the best results.
In the second scenario (Section 5.2), a comparison was made in terms of the indicator ACC between the best variants of the proposed algorithm ( F R and F S R ) and the reference DNN model. The results showed that the compared methods obtained similar values and that there is no statistically significant difference with respect to the variants of the proposed method. Overall, considering all the experimentas, the recommended variants are F R and F S R , in that order.
Thus, within the analysis conducted throughout the experimentation of this work, it was shown that no variant of μ GA-DNN presented a statistically significant difference with respect to E S and a reference DNN model in terms of the ACC indicator. Furthermore, the proposed method is computationally efficient, since all variants managed to reduce the number of neurons in the FC layers. This allows a reduction in the complexity of the network architecture (while decreasing the MC indicator), which implies a lower number of operations within the DNN model.

Limitations and Future Work

The proposed algorithm and reference methods were evaluated on datasets of classification problems with variables in the domain of real numbers. Generalization to other types of variables (e.g., categorical) or other types of problems (e.g., regression) is not straightforward. An extension to other domains is initially possible; however, more experimentation is needed.
Therefore, it is essential to consider the scope of the study when applying the proposed algorithm to different contexts and problems.
Additionally, optimizing FC layers only was a design decision in this work because the optimization of convolutional layers entails a much higher computational cost. While the results obtained are promising, the improvements in terms of accuracy and complexity when optimizing convolutional layers are a matter of further studies. Future research should examine whether possible improvements on some metrics justify the added computational cost.

Author Contributions

Conceptualization, G.T.; methodology, R.L. and G.T.; software, D.T.-A.; validation, R.L. and D.T.-A.; formal analysis, D.T.-A.; investigation, R.L. and D.T.-A.; resources, R.L. and G.T.; data curation, D.T.-A.; writing—original draft preparation, R.L.; writing—review and editing, R.L.; visualization, D.T.-A.; supervision, R.L. and G.T.; project administration, G.T.; funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by The Catholic University of America. The second author acknowledges support from Conahcyt to pursue graduate studies at the Tamaulipas Campus of Cinvestav.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANNArtificial neural network
CNNConvolutional neural network
FCFully connected
FSFeature selection
GAGenetic algorithm
TLTransfer learning

References

  1. Samuel, A.L. Machine learning. Technol. Rev. 1959, 62, 42–45. [Google Scholar]
  2. Bengio, Y. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press: London, UK, 2016. [Google Scholar]
  3. Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
  4. Pathak, A.R.; Pandey, M.; Rautaray, S. Application of deep learning for object detection. Procedia Comput. Sci. 2018, 132, 1706–1717. [Google Scholar] [CrossRef]
  5. Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 2019, 52, 1089–1106. [Google Scholar] [CrossRef]
  6. Yadav, S.S.; Jadhav, S.M. Deep convolutional neural network based medical image classification for disease diagnosis. J. Big Data 2019, 6, 113. [Google Scholar] [CrossRef]
  7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  8. Wen, Y.W.; Peng, S.H.; Ting, C.K. Two-stage evolutionary neural architecture search for transfer learning. IEEE Trans. Evol. Comput. 2021, 25, 928–940. [Google Scholar] [CrossRef]
  9. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
  10. Aggarwal, C.C. Neural Networks and Deep Learning; Springer: Cham, Switzerland, 2018. [Google Scholar]
  11. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
  12. Barraza, J.F.; Droguett, E.L.; Martins, M.R. Towards Interpretable Deep Learning: A Feature Selection Framework for Prognostics and Health Management Using Deep Neural Networks. Sensors 2021, 21, 5888. [Google Scholar] [CrossRef]
  13. Poyatos, J.; Molina, D.; Martinez, A.D.; Del Ser, J.; Herrera, F. EvoPruneDeepTL: An evolutionary pruning model for transfer learning based deep neural networks. Neural Netw. 2022, 158, 59–82. [Google Scholar] [CrossRef]
  14. Goldberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning; Addison Wesley: Boston, MA, USA, 1989. [Google Scholar]
  15. Ledesma, S.; Cerda, G.; Aviña, G.; Hernández, D.; Torres, M. Feature Selection Using Artificial Neural Networks. In MICAI 2008: Advances in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; pp. 351–359. [Google Scholar] [CrossRef]
  16. Krishnakumar, K. Micro-Genetic Algorithms For Stationary And Non-Stationary Function Optimization. In Intelligent Control and Adaptive Systems; Rodriguez, G., Ed.; SPIE: Bellingham, WA, USA, 1990. [Google Scholar] [CrossRef]
  17. Goldberg, D.E. Sizing populations for serial and parallel genetic algorithms. In Proceedings of the Third International Conference on Genetic Algorithms, San Francisco, CA, USA, 1 June 1989; pp. 70–79. [Google Scholar]
  18. Dubey, A.; Saxena, A. An evolutionary feature selection technique using polynomial neural network. Int. J. Comput. Sci. Issues 2011, 8, 494. [Google Scholar]
  19. Mohammed, T.A.; Alhayali, S.; Bayat, O.; Uçan, O.N. Feature Reduction Based on Hybrid Efficient Weighted Gene Genetic Algorithms with Artificial Neural Network for Machine Learning Problems in the Big Data. Sci. Program. 2018, 2018, 2691759. [Google Scholar] [CrossRef]
  20. Üstün, O.; Bekiroğlu, E.; Önder, M. Design of highly effective multilayer feedforward neural network by using genetic algorithm. Expert Syst. 2020, 37, e12532. [Google Scholar] [CrossRef]
  21. Luo, X.; Oyedele, L.O.; Ajayi, A.O.; Akinade, O.O.; Delgado, J.M.D.; Owolabi, H.A.; Ahmed, A. Genetic algorithm-determined deep feedforward neural network architecture for predicting electricity consumption in real buildings. Energy AI 2020, 2, 100015. [Google Scholar] [CrossRef]
  22. Arroyo, J.C.T.; Delima, A.J.P. An Optimized Neural Network Using Genetic Algorithm for Cardiovascular Disease Prediction. J. Adv. Inf. Technol. 2022, 13, 95–99. [Google Scholar] [CrossRef]
  23. Souza, F.; Matias, T.; Araójo, R. Co-evolutionary genetic Multilayer Perceptron for feature selection and model design. In Proceedings of the International Conference on Emerging Technologies and Factory Automation (ETFA2011), Toulouse, France, 5–9 September 2011; pp. 1–7. [Google Scholar] [CrossRef]
  24. Pham, T.A.; Tran, V.Q.; Vu, H.L.T.; Ly, H.B. Design deep neural network architecture using a genetic algorithm for estimation of pile bearing capacity. PLoS ONE 2020, 15, e0243030. [Google Scholar] [CrossRef]
  25. Baldominos, A.; Saez, Y.; Isasi, P. Hybridizing Evolutionary Computation and Deep Neural Networks: An Approach to Handwriting Recognition Using Committees and Transfer Learning. Complexity 2019, 2019, 2952304. [Google Scholar] [CrossRef]
  26. Tian, H.; Chen, S.C.; Shyu, M.L. Genetic Algorithm Based Deep Learning Model Selection for Visual Data Classification. In Proceedings of the 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, 30 July–1 August 2019. [Google Scholar] [CrossRef]
  27. de Lima Mendes, R.; da Silva Alves, A.H.; de Souza Gomes, M.; Bertarini, P.L.L.; do Amaral, L.R. Many Layer Transfer Learning Genetic Algorithm (MLTLGA): A New Evolutionary Transfer Learning Approach Applied To Pneumonia Classification. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021. [Google Scholar] [CrossRef]
  28. Li, C.; Jiang, J.; Zhao, Y.; Li, R.; Wang, E.; Zhang, X.; Zhao, K. Genetic Algorithm based hyper-parameters optimization for transfer convolutional neural network. In Proceedings of the International Conference on Advanced Algorithms and Neural Networks (AANN 2022), Zhuhai, China, 25–27 February 2022. [Google Scholar] [CrossRef]
  29. Bibi, R.; Mehmood, Z.; Munshi, A.; Yousaf, R.M.; Ahmed, S.S. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval. PLoS ONE 2022, 17, e0274764. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  31. Casillas, E.S.M.; Osuna-Enciso, V. Architecture Optimization of Convolutional Neural Networks by Micro Genetic Algorithms. In Metaheuristics in Machine Learning: Theory and Applications; Springer International Publishing: Cham, Switzerland, 2021; pp. 149–167. [Google Scholar] [CrossRef]
  32. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  33. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 2nd ed.; MIT Press: London, UK, 2001. [Google Scholar]
  34. Deb, K. An efficient constraint handling method for genetic algorithms. Comput. Methods Appl. Mech. Eng. 2000, 186, 311–338. [Google Scholar] [CrossRef]
  35. Ding, C.; Peng, H. Minumum Redundancy Feature Selection from Microarray Gene Expression Data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
  36. Deb, K. Optimization for Engineering Design: Algorithms and Examples; Prentice-Hall of India Private Limited: Delhi, India, 2012. [Google Scholar]
  37. github user: Yiweichen04. Cataract Dataset. 2016. Available online: https://github.com/yiweichen04/retina_dataset (accessed on 17 August 2023).
  38. kaggle user: Nitesh Yadav. Chessman Image Dataset. 2016. Available online: https://www.kaggle.com/datasets/niteshfre/chessman-image-dataset (accessed on 17 August 2023).
  39. kaggle user: Pranav Raikote. COVID-19 Image Dataset. 2020. Available online: https://www.kaggle.com/datasets/pranavraikokte/covid19-image-dataset (accessed on 17 August 2023).
  40. Team, T.T. Flowers. 2019. Available online: http://download.tensorflow.org/example_images/flower_photos.tgz (accessed on 17 August 2023).
  41. Rauf, H.T.; Saleem, B.A.; Lali, M.I.U.; Khan, M.A.; Sharif, M.; Bukhari, S.A.C. A Citrus Fruits and Leaves Dataset for Detection and Classification of Citrus Diseases through Machine Learning. 2019. Available online: https://data.mendeley.com/datasets/3f83gxmv57/2 (accessed on 17 August 2023).
  42. kaggle user: Muhammad Ahmad. MIT Indoor Scenes. 2019. Available online: https://www.kaggle.com/datasets/itsahmad/indoor-scenes-cvpr-2019 (accessed on 17 August 2023).
  43. Museum, V.R. Art Images: Drawing/Painting/Sculptures/Engravings. 2018. Available online: https://www.kaggle.com/datasets/thedownhill/art-images-drawings-painting-sculpture-engraving (accessed on 17 August 2023).
  44. Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A Dataset for Visual Plant Disease Detection. 2020. Available online: https://github.com/pratikkayal/PlantDoc-Dataset (accessed on 17 August 2023).
  45. Moroney, L. Rock, Paper, Scissors Dataset. 2019. Available online: https://www.tensorflow.org/datasets/catalog/rock_paper_scissors?hl=es-419 (accessed on 17 August 2023).
  46. Collaboration, I.S.I. Skin Cancer: Malignant vs. Benign. 2019. Available online: https://www.kaggle.com/datasets/fanconic/skin-cancer-malignant-vs-benign (accessed on 17 August 2023).
  47. Gómez-Ríos, A.; Tabik, S.; Luengo, J.; Shihavuddin, A.; Herrera, F. Coral Species Identification with Texture or Structure Images Using a Two-Level Classifier Based on Convolutional Neural Networks. 2019. Available online: https://sci2s.ugr.es/CNN-coral-image-classification (accessed on 17 August 2023).
  48. Oluwafemi, A.G.; Zenghui, W. Multi-Class Weather Classification from Still Image Using Said Ensemble Method. 2019. Available online: https://www.kaggle.com/datasets/somesh24/multiclass-images-for-weather-classification (accessed on 17 August 2023).
  49. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014. [Google Scholar] [CrossRef]
  50. Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
Figure 1. General scheme of the proposed method. Top: CNN model trained on a source domain. Center: Transfer learning of the pre-trained model parameters and tuning of the model weights of a DNN (FC layers) using the μ GA-DNN algorithm. Bottom: Schematic illustrating the operation of the proposed method.
Figure 1. General scheme of the proposed method. Top: CNN model trained on a source domain. Center: Transfer learning of the pre-trained model parameters and tuning of the model weights of a DNN (FC layers) using the μ GA-DNN algorithm. Bottom: Schematic illustrating the operation of the proposed method.
Ai 05 00127 g001
Figure 2. Example of an FC layer architecture: The input layer is related to the d features automatically extracted by the convolutional layers of the CNN model. In this architecture, there are m hidden layers, where L 1 , L 2 , , L m indicate the number of neurons in each. Finally, the output layer provides a response (prediction) z i , with i = 1 , , c , for each of the c classes of the input dataset.
Figure 2. Example of an FC layer architecture: The input layer is related to the d features automatically extracted by the convolutional layers of the CNN model. In this architecture, there are m hidden layers, where L 1 , L 2 , , L m indicate the number of neurons in each. Finally, the output layer provides a response (prediction) z i , with i = 1 , , c , for each of the c classes of the input dataset.
Ai 05 00127 g002
Figure 3. Example of an FC layer architecture. The input layer is related to the features of the dataset. Thus, this scheme assumes that the problem has a dimensionality of d = 2048 . Additionally, it has m = 2 hidden layers, each with 512 neurons, and the output layer has c neurons, where c is the number of classes of the problem.
Figure 3. Example of an FC layer architecture. The input layer is related to the features of the dataset. Thus, this scheme assumes that the problem has a dimensionality of d = 2048 . Additionally, it has m = 2 hidden layers, each with 512 neurons, and the output layer has c neurons, where c is the number of classes of the problem.
Ai 05 00127 g003
Figure 4. Example of a chromosome using the binary representation μ GA-1, which is composed of three binary blocks representing the first hidden layer ( L 1 ), the second hidden layer ( L 2 ), and the learning rate ( l r ).
Figure 4. Example of a chromosome using the binary representation μ GA-1, which is composed of three binary blocks representing the first hidden layer ( L 1 ), the second hidden layer ( L 2 ), and the learning rate ( l r ).
Ai 05 00127 g004
Figure 5. Example of a chromosome that uses the binary representation μ GA-FS, which is composed of 2051 binary blocks. The first two blocks consist of 9 bits each, the third of 17 bits, and the last 2048 blocks each consist of 1 bit.
Figure 5. Example of a chromosome that uses the binary representation μ GA-FS, which is composed of 2051 binary blocks. The first two blocks consist of 9 bits each, the third of 17 bits, and the last 2048 blocks each consist of 1 bit.
Ai 05 00127 g005
Figure 6. Example of a chromosome using the binary representation μ GA-MRMR, which is composed of four binary blocks.
Figure 6. Example of a chromosome using the binary representation μ GA-MRMR, which is composed of four binary blocks.
Ai 05 00127 g006
Figure 7. Results of the ACC indicator obtained by each method are shown using boxplots. The best values are indicated in bold. The mean of ACC and the p-value of the Wilcoxon rank sum test are shown at the top of each plot.
Figure 7. Results of the ACC indicator obtained by each method are shown using boxplots. The best values are indicated in bold. The mean of ACC and the p-value of the Wilcoxon rank sum test are shown at the top of each plot.
Ai 05 00127 g007
Figure 8. Results of the mean FSR obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with F R (the best variant of the proposed method in terms of this indicator). In bold, p < 0.05 .
Figure 8. Results of the mean FSR obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with F R (the best variant of the proposed method in terms of this indicator). In bold, p < 0.05 .
Ai 05 00127 g008
Figure 9. Results of the mean HNR obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with F R (the best variant of the proposed method in terms of this indicator). In bold, p < 0.05 .
Figure 9. Results of the mean HNR obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with F R (the best variant of the proposed method in terms of this indicator). In bold, p < 0.05 .
Ai 05 00127 g009
Figure 10. Results of the average MC obtained by each method. The best values are indicated in bold.
Figure 10. Results of the average MC obtained by each method. The best values are indicated in bold.
Ai 05 00127 g010
Figure 11. Results of the average number of evaluations of the objective function obtained by each method. The best values are indicated in bold.
Figure 11. Results of the average number of evaluations of the objective function obtained by each method. The best values are indicated in bold.
Ai 05 00127 g011
Figure 12. Results of the average runtime (in seconds) of each method.
Figure 12. Results of the average runtime (in seconds) of each method.
Ai 05 00127 g012
Figure 13. Comparison of classification accuracy for the proposed models and the complete reference model.
Figure 13. Comparison of classification accuracy for the proposed models and the complete reference model.
Ai 05 00127 g013
Table 1. Comparison of methods. B. S. stands for batch size, Gens. means the maximum number of generations, Pop. means the number of individuals in population, FS indicates if the method performs the FS task, and HL indicates if the method reduces the hidden layer size. NS means not specified in the original source.
Table 1. Comparison of methods. B. S. stands for batch size, Gens. means the maximum number of generations, Pop. means the number of individuals in population, FS indicates if the method performs the FS task, and HL indicates if the method reduces the hidden layer size. NS means not specified in the original source.
MethodANN ModelGA TypeFSHL
TL ModelB. S.EpochsOptimizerNameGens.Pop.
Ledesma et al. [15]NS500NSGA8100
Saxena et al. [18]NSNSNSGA2560
WGGA [19]NSNSNSGA8010
GNN [20]NS1000SGDGA100028
SGD, ADAM,
GA-DFNN [21]NS12,000NADAM,GA30, 4020
ADAMAX
RMSPROP,
ADAM, SGD,
GA-ANN [22]102425ADADELTA,GA2520
ADAMAX,
NADAM.
CEV-MLP [23]NSNSNSGA12020, 30, 200
Pham et al. [24]NSNSQuasi-Newton,
SGD, ADAM
GA20025
Baldominos et al. [25]NS25–2005, 30SGD, ADAMGA, GE10050
Tian y Shyu [26]Inception V3
ResNet50
MobileNet
DenseNet201
NSNSNSGENS10
MLTGA [27]Inception V3165, 50SGDGA520
EvoNAS-TL [8]VGG-162563SGDKGEA30, 500 130, 200 2
Li et al. [28]MobileNetV2NSNSNSGA1450
Bibi et al. [29]VGG-19NSNSSGDGANSNS
EvoPruneDeepTL [13]ResNet5032600SGDGA10303
μ GA-DNN
(Our proposed approach)
ResNet5032100ADAM μ GA504
1 30 for global search, 500 for local search. 2 30 for global search, 500 for local search. 3 EvoPruneDeepTL provides two variants, one for FS and one for HL, but the one for FS outperformed in the original source and is the one used here for comparison.
Table 2. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-1 representation).
Table 2. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-1 representation).
Binary Block x ( L ) x ( U ) pr n b
Hidden layer 1 ( L 1 )151209
Hidden layer 2 ( L 2 )151209
Learning rate ( l r ) 1 × 10 6 1 × 10 1 617
Table 3. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-FS representation).
Table 3. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-FS representation).
Binary Block x ( L ) x ( U ) pr n b
Hidden layer 1 ( L 1 )151209
Hidden layer 2 ( L 2 )151209
Learning rate ( l r ) 1 × 10 6 1 × 10 1 617
S 1 0101
S 2048 0101
Table 4. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-MRMR representation).
Table 4. Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( μ GA-MRMR representation).
Binary Block x ( L ) x ( U ) pr n b
Hidden layer 1 ( L 1 )151209
Hidden layer 2 ( L 2 )151209
Learning rate ( l r ) 1 × 10 6 1 × 10 1 617
Number of selected features ( n s )12048011
Table 5. Relationship between the three proposed representations and the four designed objective functions. Each combination of a representation with an objective function forms one of the eight variants of the proposed algorithm.
Table 5. Relationship between the three proposed representations and the four designed objective functions. Each combination of a representation with an objective function forms one of the eight variants of the proposed algorithm.
Representation f 1 f 2 f 3 f 4
μ GA-1 F S F
μ GA-FS F S W F W F S H F H
μ GA-MRMR F S R F R
Table 6. Description of the adopted datasets. n is the number of instances, d is the number of predictor variables, and c indicates the number of classes.
Table 6. Description of the adopted datasets. n is the number of instances, d is the number of predictor variables, and c indicates the number of classes.
NamendcSource
Cataract60120484[37]
Chessman55620486[38]
COVID-1931720483[39]
Flowers367020485[40]
Leaves59620484[41]
MIT-IS15,620204867[42]
Painting857720485[43]
Plants2576204827[44]
RPS289220483[45]
Skincancer329720482[46]
SRSMAS409204814[47]
Weather112520484[48]
Table 7. Parameters of EvoPruneDeepTL.
Table 7. Parameters of EvoPruneDeepTL.
ParameterValue
Steady-state GA
Population size30
Number of evaluations300
Crossover probability (uniform)0.5
Mutation probability0.07
SGD optimizer
Learning rate ( η )0.001
Moment Nesterov0.9
Batch size32
Number of epochs100
Table 8. Parameters of the proposed algorithm.
Table 8. Parameters of the proposed algorithm.
μ GAValue
Population size ( n p )4
Number of generations ( g m a x )50
Crossover probability ( p c )0.9
Convergence threshold0.05
Adam optimizer
Batch size32
Number of epochs100
Table 9. Experimental results of ACC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 9. Experimental results of ACC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract0.6160.5760.5700.6000.6110.5420.5670.5860.607
Chessman0.7980.7750.7690.8000.7780.7640.7410.8010.756
COVID-190.9480.9720.8680.9700.9760.9670.9510.9750.973
Flowers0.8810.8660.8620.8780.8740.7930.8480.8690.879
Leaves0.8990.8910.8650.8940.8880.8940.8570.8860.901
MIT Indoor Scenes0.6970.6790.6660.6940.6380.5730.5560.6730.735
Painting0.9370.9330.9310.9380.9340.9180.9220.9300.935
Plants0.3490.3570.3230.3660.3200.2260.2810.3270.376
RPS1.0001.0001.0001.0001.0000.9991.0001.0001.000
Skincancer0.8190.8610.8540.8590.8550.8100.8470.8590.869
SRSMAS0.8030.8010.8000.8040.7850.7510.7800.7820.777
Weather0.9640.9600.9600.9640.9660.9520.9540.9620.960
Statistic
Mean0.8090.8060.7890.8140.8020.7660.7750.8040.814
STD0.1760.1800.1820.1760.1880.2140.2020.1860.172
Median0.8500.8640.8580.8690.8650.8020.8480.8640.874
MAD0.0930.0920.0810.0820.0940.1330.1050.0900.098
Maximum1.0001.0001.0001.0001.0000.9991.0001.0001.000
Minimum0.3490.3570.3230.3660.3200.2260.2810.3270.376
Count311330125
Table 10. Experimental results of FSR obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 10. Experimental results of FSR obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract1.0000.5080.5040.5221.0000.5010.4950.4680.536
Chessman1.0000.4920.5010.2231.0000.5020.5020.3330.667
COVID-191.0000.5030.4980.3331.0000.4980.4970.2330.496
Flowers1.0000.5060.4940.6691.0000.4990.4970.5680.746
Leaves1.0000.5010.5070.3491.0000.5040.4980.3570.614
MIT Indoor Scenes1.0000.4970.5010.6291.0000.5010.5040.4880.935
Painting1.0000.4990.5030.6931.0000.4970.5050.5860.728
Plants1.0000.5020.4990.5061.0000.5010.5000.5550.856
RPS1.0000.5000.4970.2981.0000.4980.5010.4740.240
Skincancer1.0000.4990.5010.5301.0000.4950.5000.5000.667
SRSMAS1.0000.5010.5010.3461.0000.5040.4990.3320.762
Weather1.0000.5010.4940.5761.0000.5030.4990.4200.656
Statistic
Mean1.0000.5010.5000.4731.0000.5000.5000.4430.659
STD0.0000.0040.0040.1510.0000.0030.0030.1050.173
Median1.0000.5010.5010.5141.0000.5010.5000.4710.667
MAD0.0000.0020.0030.1600.0000.0030.0020.0910.087
Maximum1.0000.5080.5070.6931.0000.5040.5050.5860.935
Minimum1.0000.4920.4940.2231.0000.4950.4950.2330.240
Count002202051
Table 11. Experimental results of HNR obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 11. Experimental results of HNR obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract0.3520.3670.4470.3840.0960.1170.1320.0931.000
Chessman0.3330.4540.3440.4100.0980.1570.2020.0991.000
COVID-190.2210.3380.1830.2150.0920.1410.1430.0801.000
Flowers0.3260.3820.3440.2730.1000.1270.1160.0981.000
Leaves0.2900.3500.2390.3400.1020.1540.1490.0911.000
MIT Indoor Scenes0.4910.5930.4650.4540.1290.2470.1950.1171.000
Painting0.1800.3090.3180.1990.1050.1540.1530.0931.000
Plants0.4220.6340.4530.4610.0920.1580.1890.1141.000
RPS0.1220.1480.1340.1130.0990.1720.1410.0931.000
Skincancer0.2330.3600.3220.2170.0860.1390.1160.0811.000
SRSMAS0.4540.4990.4920.3450.1300.1610.2160.1111.000
Weather0.1510.2840.1920.1700.1000.1470.1160.0991.000
Statistic
Mean0.2980.3930.3280.2980.1020.1560.1560.0971.000
STD0.1150.1290.1160.1120.0130.0310.0340.0110.000
Median0.3080.3640.3330.3070.1000.1540.1460.0961.000
MAD0.1010.0670.1170.0980.0040.0100.0300.0040.000
Maximum0.4910.6340.4920.4610.1300.2470.2160.1171.000
Minimum0.1220.1480.1340.1130.0860.1170.1160.0801.000
Count0000200100
Table 12. Experimental results of MC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 12. Experimental results of MC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract5.5 × 1052.2 × 1053.4 × 1052.8 × 1051.0 × 1056.3 × 1047.5 × 1045.2 × 1045.6 × 105
Chessman3.6 × 1052.9 × 1052.1 × 1051.5 × 1059.9 × 1047.4 × 1041.4 × 1052.9 × 1047.0 × 105
COVID-191.8 × 1052.5 × 1051.0 × 1058.4 × 1049.5 × 1047.8 × 1048.9 × 1042.1 × 1045.2 × 105
Flowers4.6 × 1052.5 × 1052.4 × 1052.1 × 1051.1 × 1057.0 × 1046.6 × 1044.6 × 1047.8 × 105
Leaves3.9 × 1052.5 × 1051.5 × 1051.7 × 1051.1 × 1051.0 × 1057.8 × 1043.1 × 1046.5 × 105
MIT Indoor Scenes5.6 × 1054.1 × 1053.6 × 1054.3 × 1051.5 × 1051.4 × 1051.3 × 1057.0 × 1041.0 × 106
Painting2.1 × 1051.9 × 1052.0 × 1051.6 × 1051.2 × 1056.9 × 1048.4 × 1044.4 × 1047.7 × 105
Plants5.8 × 1055.1 × 1053.0 × 1053.4 × 1051.1 × 1058.4 × 1041.5 × 1057.0 × 1049.1 × 105
RPS1.6 × 1059.8 × 1047.7 × 1044.7 × 1041.0 × 1059.6 × 1046.3 × 1046.0 × 1042.5 × 105
Skincancer2.6 × 1052.5 × 1052.3 × 1051.8 × 1056.9 × 1047.6 × 1046.3 × 1043.9 × 1047.0 × 105
SRSMAS5.7 × 1053.5 × 1053.2 × 1051.7 × 1051.7 × 1059.1 × 1041.2 × 1053.7 × 1048.1 × 105
Weather1.9 × 1052.1 × 1051.3 × 1051.2 × 1051.2 × 1057.3 × 1046.7 × 1043.5 × 1046.9 × 105
Statistic
Mean3.7 × 1052.7 × 1052.2 × 1052.0 × 1051.1 × 1058.5 × 1049.3 × 1044.5 × 1047.0 × 105
STD1.6 × 1051.0 × 1059.1 × 1041.0 × 1052.4 × 1042.0 × 1043.0 × 1041.5 × 1041.9 × 105
Median3.8 × 1052.5 × 1052.2 × 1051.7 × 1051.1 × 1057.7 × 1048.1 × 1044.1 × 1047.0 × 105
MAD1.8 × 1054.0 × 1048.6 × 1044.4 × 1041.1 × 1047.9 × 1031.7 × 1041.1 × 1049.4 × 104
Maximum5.8 × 1055.1 × 1053.6 × 1054.3 × 1051.7 × 1051.4 × 1051.5 × 1057.0 × 1041.0 × 106
Minumum1.6 × 1059.8 × 1047.7 × 1044.7 × 1046.9 × 1046.3 × 1046.3 × 1042.1 × 1042.5 × 105
Count0001000110
Table 13. Experimental results of the number of objective function evaluations obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 13. Experimental results of the number of objective function evaluations obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract249.2250.8247.6249.2251.2249.6247.2253.2300.0
Chessman251.6247.2252.4253.6254.0248.4250.4253.2300.0
COVID-19249.2248.0243.6246.8250.8250.0248.8252.0300.0
Flowers249.2248.8241.2253.6252.8246.8250.8250.4300.0
Leaves245.2252.0242.8248.0250.0250.4246.8250.4300.0
MIT Indoor Scenes251.2250.0244.4252.0249.6246.4250.0255.2300.0
Painting252.0241.2236.8252.4251.6250.8246.0248.0300.0
Plants250.4248.0245.6248.8253.6248.0247.6254.0300.0
RPS246.8240.4239.2243.6248.4241.2247.2248.0300.0
Skincancer250.8246.8244.0250.4252.4249.6248.0248.8300.0
SRSMAS249.6255.2254.8249.2253.2253.2247.6258.0300.0
Weather250.4250.0237.6250.0253.6244.8248.0252.4300.0
Statistic
Mean249.6248.2244.2249.8251.8248.3248.2252.0300.0
STD1.8873.9785.2272.7881.7243.0211.4382.8960.000
Median250.0248.4243.8249.6252.0249.0247.8252.2300.0
MAD0.8001.6003.2002.0001.4001.6000.8001.8000.000
Maximum252.0255.2254.8253.6254.0253.2250.8258.0300.0
Minimum245.2240.4236.8243.6248.4241.2246.0248.0300.0
Count019000200
Table 14. Results of the percentage of similarity between the binary vectors of selected features in each independent experiment in terms of the Hamming distance obtained in each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 14. Results of the percentage of similarity between the binary vectors of selected features in each independent experiment in terms of the Hamming distance obtained in each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S F S W F S H F S R F F W F H F R E S
Cataract0.0000.4980.4990.3330.0000.4990.4970.3690.500
Chessman0.0000.5010.5040.1640.0000.4980.5010.2570.445
COVID-190.0000.5000.5030.2830.0000.5010.5010.1580.501
Flowers0.0000.4990.5000.2760.0000.4980.4970.2190.379
Leaves0.0000.5000.5020.3130.0000.5010.5000.3010.478
MIT Indoor Scenes0.0000.5000.4980.2610.0000.5020.5020.2300.122
Painting0.0000.4990.5000.2460.0000.5010.5000.2980.396
Plants0.0000.5030.5010.3130.0000.4990.5010.2680.247
RPS0.0000.4980.5000.3040.0000.4990.5000.4210.365
Skincancer0.0000.4980.5010.3590.0000.4970.5010.3370.442
SRSMAS0.0000.4990.4970.1900.0000.5000.5010.2310.365
Weather0.0000.5000.5000.3430.0000.4990.4990.3620.454
Statistic
Mean0.0000.5000.5000.2820.0000.4990.5000.2880.391
STD0.0000.0010.0020.0570.0000.0010.0010.0720.106
Median0.0000.4990.5000.2940.0000.4990.5000.2830.419
MAD0.0000.0010.0010.0360.0000.0010.0010.0540.054
Maximum0.0000.5030.5040.3590.0000.5020.5020.4210.501
Minimum0.0000.4980.4970.1640.0000.4970.4970.1580.122
Count000600042
Table 15. Runtime results (in seconds) for each independent experiment obtained on each dataset. Seven statistics summarizing the results for each method are shown.
Table 15. Runtime results (in seconds) for each independent experiment obtained on each dataset. Seven statistics summarizing the results for each method are shown.
Name F S F S W F S H F S R F F W F H F R E S
Cataract3979.23848.34147.08806.35995.54211.98245.63216.75143.2
Chessman4222.03843.24491.59413.06734.15138.07530.53443.46653.2
COVID-192642.13186.63186.63810.27141.23122.86294.32538.05521.4
Flowers14,724.312,703.023,039.719,934.222,521.714,444.025,522.215,989.523,206.4
Leaves4202.44107.84111.95991.36670.53240.67113.77219.47278.7
MIT Indoor Scenes46,470.840,690.139,701.551,172.446,451.446,016.743,993.554,461.559,243.7
Painting34,243.534,206.032,419.436,665.534,821.833,535.049,336.542,956.449,177.9
Plants12,191.310,965.914,134.012,613.016,206.69014.718,021.111,433.416,839.1
RPS11,306.19309.915,451.019,411.416,854.811,502.321,079.919,926.017,154.9
Skincancer14,809.912,778.020,445.714,653.221,197.214,366.522,900.523,124.519,775.6
SRSMAS3035.03376.33988.63536.45545.13679.46441.53114.44588.2
Weather5463.26263.25925.46276.913,010.86269.710,675.610,159.17741.5
Statistic
Mean13,107.512,106.514,253.516,023.716,929.212,878.518,929.616,465.218,527.0
STD13,167.011,922.411,874.613,839.412,285.512,902.514,097.615,994.617,204.4
Median8384.67786.510,029.711,013.014,608.77642.214,348.310,796.212,290.3
MAD4877.54176.85979.56112.27893.84182.27570.77630.76958.0
Maximum46,470.840,690.139,701.551,172.446,451.446,016.749,336.554,461.559,243.7
Minimum2642.13186.63186.63536.45545.13122.86294.32538.04588.2
Table 16. Comparison of the experimental results of ACC obtained on each dataset, between the models obtained by the methods F S R , F R and the model trained from the reference DNN architecture. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Table 16. Comparison of the experimental results of ACC obtained on each dataset, between the models obtained by the methods F S R , F R and the model trained from the reference DNN architecture. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.
Name F S R F R DNN
Cataract0.6000.5860.631
Chessman0.8000.8010.791
COVID-190.9700.9750.981
Flowers0.8780.8690.881
Leaves0.8940.8860.901
MIT Indoor Scenes0.6940.6730.700
Painting0.9380.9300.941
Plants0.3660.3270.374
RPS1.0001.0001.000
Skincancer0.8590.8590.868
SRSMAS0.8040.7820.815
Weather0.9640.9620.966
Statistic
Mean0.8140.8040.821
STD0.1760.1860.172
Median0.8690.8640.875
MAD0.0820.0900.087
Maximum1.0001.0001.000
Minimum0.3660.3270.374
Count1211
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Landa, R.; Tovias-Alanis, D.; Toscano, G. Optimization of Deep Neural Networks Using a Micro Genetic Algorithm. AI 2024, 5, 2651-2679. https://doi.org/10.3390/ai5040127

AMA Style

Landa R, Tovias-Alanis D, Toscano G. Optimization of Deep Neural Networks Using a Micro Genetic Algorithm. AI. 2024; 5(4):2651-2679. https://doi.org/10.3390/ai5040127

Chicago/Turabian Style

Landa, Ricardo, David Tovias-Alanis, and Gregorio Toscano. 2024. "Optimization of Deep Neural Networks Using a Micro Genetic Algorithm" AI 5, no. 4: 2651-2679. https://doi.org/10.3390/ai5040127

APA Style

Landa, R., Tovias-Alanis, D., & Toscano, G. (2024). Optimization of Deep Neural Networks Using a Micro Genetic Algorithm. AI, 5(4), 2651-2679. https://doi.org/10.3390/ai5040127

Article Metrics

Back to TopTop