1. Introduction
With the advancement in the development of computers, the network has drastically improved people’s day-to-day lives and activities. Apart from this, computing technologies like the cloud, the Internet of Things (IoT), and big data also achieved rapid growth to an indescribable level, and their need has become undeniable [
1,
2]. The data being uploaded to these technologies via a network connection undergoes several stages, including creation, transfer, storage, and deletion [
3]. The information included in any of these stages is important, particularly related to financial, medical, government, or military applications [
4]. With the increase in the number of users associated with the Internet, the circulation of sensitive information on the Internet also increases, which welcomes hackers and attackers to create security threats to the data via several proficient software tools [
5]. Also, today’s Internet is highly misused by several kinds of attacks, network failures, cases of abuse, and malicious behaviors. Some recent prevalent attacks in the network include worms, denial of service (DoS), and port viewing attacks [
6,
7]. All these attacks and malicious behaviors create problems for the network, resulting in resource wastage, security threats to users’ sensitive information, and a reduction in the performance of an end host and network equipment [
8].
System vulnerability is the specific cause of intrusions occurring in the systems as it paves the way for intruders to destroy, steal, or alter the information, causing damage to the systems associated with the network [
9]. Since several attacks threaten the normal functioning of networks and systems, effective security and data privacy mechanisms are urgently needed to overcome future issues [
10]. Effective dynamic management of network traffic is crucial to respond effectively to security threats while developing the Internet [
11]. Some of the important criteria required to be viewed while developing any security measure include monitoring network performance, accurately understanding network dynamics and operations, meeting most user needs, and ensuring the quality of service (QoS) [
12,
13]. While offering security to the network systems, security engineers need to focus on the following terms: confidentiality, integrity, and availability (CIA triad) to ensure better and stable performance [
14].
Intrusion detection systems (IDSs) are the ultimate focus of the research community when dealing with any security threats or issues. These systems recently provided beneficial solutions to meet the challenges of malicious behaviors [
15]. The IDSs can be software, hardware, or a combination of both, and these are used to detect suspicious activities on the host, known as host-based IDS, or the network, named network-based IDS. The IDSs are categorized into specification-based, signature-based, and anomaly-based IDS [
16]. The first category monitors skewness in the incoming network packets using a set of rules, the second category matches the traffic flow with the stored signatures of known attacks, and the final category identifies the threats in the network [
17,
18]. With the rapid growth in attacks, the signature-based IDS needs continuous database updates, and the specification-based IDS requires expert knowledge, which is the main drawback. The third type is considered more advanced than the other two types and mainly uses classification algorithms to deal with problems. Upon reviewing the existing mechanisms, it was identified that much work has been done on intrusion detection. Artificial intelligence and machine learning algorithms are preferred to promote the accurate classification of attacks [
19,
20]. Several deep learning and machine learning-based methodologies showed better and more efficient performance in accurately identifying the normal and attack instances. Most are subjected to identifying the different types of attacks threatening the network.
In [
21], an adaptive weighted kernel support vector machine-based intrusion detection method is utilized. This model used a Pareto-based ensemble technique to achieve optimal features. A set of extraction criteria is utilized to identify the typical and unusual patterns of network traffic. Further, enhanced pelican optimization-based SVM [
22] and other machine learning models [
23] are employed. While traditional machine learning has established promising outcomes in identifying abnormal traffic, it excessively relies on the manual extraction of traffic features. More significant human intervention decreases its robustness as well as the ability for generalization. In addition, classical machine learning models have restricted learning and classification ability owing to their shallow structure. Deep learning is proficient in handling complex encrypted traffic and automatically learning feature representations from complex data with no requirement for human interaction, compared to conventional machine learning. Real-time detection is challenging with deep learning despite its distinct advantages because of its internal structure, which is more complex, and its comparatively high training and complexity in prediction time. These models also undergo various issues, such as overfitting problems, less classification ability, and so on. In addition, the evolution of cyber threats makes it challenging to keep labeled datasets current and indicative of actual attack circumstances. The model can be trained after labeling the data. The learning ability of the model is better as the quality of the labeled data is higher. However, the labeled data are currently far behind application requirements, and labeling is expensive. Thereby, a novel semi-supervised deep RL model is focused on in this paper to classify the instances into normal attacks. The significant multi-fold contributions of the paper are listed below:
A novel semi-supervised enhanced deep RL (EDRL) framework is designed to promote security improvement in the network.
To incorporate a deep autoencoder (AE) and improved flamingo search algorithm (IFSA) in deep RL to improve the classification performance and minimize the manual labeling cost.
To adapt an optimization process in EDRL based on IFSA for optimal policy selection, making it flexible so it can employ several complex problems.
To employ a support vector machine (SVM) classifier to classify the attack and normal instances based on the traffic features.
Extensive evaluations are carried out using the UNSW-NB15 dataset under various scenarios to prove the performance efficacy of the proposed method compared to existing works.
The paper is structured as follows: a literature review of the existing intrusion detection and response schemes is presented in
Section 2. The proposed methodology with algorithms is described in
Section 3, the comparative analysis and results of the proposed framework are presented in
Section 4, and
Section 5 concludes the paper with future scopes.
2. Related Work
Some of the recent existing methodologies related to intrusion detection and response are reviewed below:
Detection of intrusion in the network is a vast research area, and multiple works are based on this domain. Different authors establish different techniques and ways to address this problem. However, deep learning-based models are highly common as they are effective in classification and can yield better accuracy than any other techniques. Keeping this as motivation, Dutta et al. [
24] introduced a deep learning-based hybrid model to deal with this problem. The method hybridized the classical autoencoder (CAE) with the deep neural network (DNN) model to perform intrusion detection. There were two phases: the first phase, feature engineering, was performed using the CAE model, and the second phase, classification, was carried out using DNN. The validations of a method were performed using the UNSW-NB15 dataset, and the outcomes proved that the method was effective in terms of different metrics.
Kamarudin et al. [
25] presented a classification framework to classify web attacks based on the attack features. An ensemble classification system was introduced to deal with the task. Initially, the filter and wrapper selection procedure was followed to discard irrelevant and redundant features. Then, the logic boost algorithm was utilized along with random forest (RF) as a weak classifier. Three datasets, improved KDD Cup 1999, NSL-KDD, and UNSW-NB15, were utilized to evaluate the framework, and the results proved the efficacy.
Another method for malicious activity detection in the network was introduced by Kumar et al. [
26], where the UNSW-NB15 dataset was considered for depth evaluations. The dataset was considered an offline dataset for developing an integrated classification model. Also, the method developed and generated a real-time dataset at NIT Patna CSE lab (RTNITP18) that was retained as a working example of the proposed method. The developed new dataset was utilized as a testing dataset, and the evaluations were carried out. Evaluating a model on both datasets proved that the method resulted in higher accuracy and a lower false alarm rate than other models. Also, the method acted as a dog watcher to identify several types of threats in the network.
The distributed denial of service (DDoS) security threat presses information systems and computer networks. Moreover, it is highly challenging to detect this threat before any mitigation measures can be taken. Thus, Shieh et al. [
27] presented a method to achieve full-scale success in detecting DDoS threats. A new integrated model was proposed to deal with the problem, where the model featured a Gaussian mixture model (GMM), bi-directional long short-term memory (Bi-LSTM), and incremental learning for attack detection. The GMM model was employed to capture the unknown traffic data, which was then labeled by the traffic engineers and provided to the model as training data. Two main datasets, CIC-IDS2017 and CIC-IDS2019, were utilized for evaluations, and the method provided promising detection results.
Traditional machine learning-based algorithms cannot completely process massive high-dimensional data, and labeling has to be performed manually, which causes increased cost issues. To deal with the problem, Dong et al. [
28] introduced a framework for intrusion detection called a semi-supervised double deep Q-network (SSDDQN). The proposed framework was based on a deep reinforcement learning algorithm and initially utilized an autoencoder model to reconstruct the traffic features. Following that, a DNN model was utilized as the classifier, where the target network initially utilized a k-means clustering algorithm and then a DNN model for prediction. The experiments of a model were carried out using AWID and NSL-KDD datasets, and the results proved that the model was efficient and effective.
Safa Mohamed and Ridha Ejbali [
29] suggested a deep SARSA-based reinforcement learning method for anomaly network IDSs. Improving the detection accuracy of modern and intricate network attacks was the primary goal of the deep SARSA model. Two well-known benchmarks, NSL-KDD and UNSW-NB15, were used to validate the performance of the deep SARSA-based reinforcement learning approach. According to the experimental data, this method performs better than other models in terms of accuracy, precision, recall, and F1 score. However, even with improved performance, the error rate of this model was higher.
Shaimaa Ahmed Elsaid et al. [
30] introduced hybrid IDS models based on grey Wolf Optimization (GWO) and deep learning. The first system combined Gated Recurrent Unit (GRU) with GWO (GRU-GWO), while the second system utilized Long Short-Term Memory (LSTM) with GWO (LSTM-GWO). By decreasing dimensionality and increasing detection accuracy, these systems seek to improve feature selection. The effectiveness of the DL-GWO technique in improving network security was highlighted by experimental results, which showed notable gains in intrusion detection accuracy and computing efficiency. However, the model could undergo overfitting and gradient explosion issues.
Further, Mohammed Jouhari et al. [
31] suggested a lightweight Convolutional Neural Network (CNN) with bidirectional Long Short-Term Memory (BiLSTM) to provide an efficient IDS model. In order to reduce the complexity of the model, a feature selection technique was used to select a few possible features. This method was ideal for IoT devices with limited resources, guaranteeing that it had satisfied their needs for computing power. It was difficult to develop a model that satisfied the requirements of IoT devices while achieving increased precision. However, the accuracy obtained by this model was very low.
3. Proposed Methodology
Network intrusion detection is one of the popular topics that has been under discussion for several decades. Users from different organizations and medical institutions upload their data via the network to the servers for storage. While uploading these data through the network, the attackers try to tamper with or misuse the data. To avoid this, several IDSs are introduced to overcome the crisis and offer security effectively. In this work, an intelligent attack classification model based on EDRL can optimally provide solutions based on enhancements with conventional methods. The architecture of the proposed EDRL model is presented in
Figure 1.
In the proposed model, the pre-processing is initially performed using null value removal and standard scalar normalization. Then, an enhancement in deep reinforcement learning (RL) is made by adapting deep AE and IFSA to improve to reconstruct the traffic features and SVM for classification.
3.1. Pre-Processing
Data pre-processing is indispensable for numerous data mining-related tasks and IDS. In the proposed method, pre-processing is performed using null value removal and standard scalar normalization. Handling null values is an important part of pre-processing to ensure classification model efficiency. When training and evaluating models, null values can extremely impede the efficiency of learning algorithms, often resulting in inaccurate or biased outcomes. To handle this problem, null value dropping, which is a popular strategy, involves eliminating columns or rows, encompassing null values from the dataset. Normalizing data is another fundamental step in the pre-processing of data, specifically for IDS, that is based on statistical features extracted from the available data. In the proposed method, the standard scaler normalizes the data by disregarding the mean and scaling the data to unit variance. The equation below offers the mathematical representation of a standard scaler.
where
denotes the mean and
signifies the standard deviation.
3.2. Attack Classification
In this section, an enhanced Deep Reinforcement Learning (DRL) based model is developed for attack classification. The enhancement of DRL is carried out based on the Deep Auto Encoder (DAE) and improved flamingo search algorithm (IFSA). This incorporation is driven by the need for effective feature extraction and robust optimization. Here, the following two updates are made in the reinforcement learning (RL) model to improve the classification ability.
The first update to be made in DRL is the employment of a Deep Autoencoder (DAE) to approximate the Q-function effectively instead of using a Q-table, and a further SVM is used for Q-value prediction.
The second update to be made in DRL is the integration of an improved flamingo search algorithm (IFSA) for optimal policy selection instead of employing the stochastic gradient descent method.
Moreover, deep reinforcement learning follows a semi-supervised model that uses a small amount of labeled data and a large amount of unlabeled data to train a model.
Basically, the RL model utilizes a Q-table to map the relationships between actions and states. However, the Q-table has a mass of states and actions, which could lead to the curse of dimensionality. Therefore, instead of using a Q-table, the DRL uses a Deep Auto Encoder (DAE) to approximate the Q-function that assists in mapping the relationships between actions and states similar to the Q-table. In RL training, the agent selects and executes actions according to the Markov Decision Process (MDP) with the greedy policy, which means the action is randomly selected by the probability. Instead of using random selection, the proposed method used IFSA for optimal policy selection to enhance the ability of MDP.
The EDRL model uses both labeled and unlabeled data in the semi-supervised learning model of DRL to enhance the classification of attack and normal classes. The initial Q-function is trained by utilizing labeled data, which includes instances of recognized attacks. This permits the model to pick up specific patterns and behaviors associated with these attacks. Establishing a baseline understanding of the normal and abnormal states of the network depends on this labeled dataset. Conversely, a deep AE is employed to process unlabeled data, which can include a combination of regular traffic and unidentified attack patterns. The model can capture the underlying structure of the data without the need for explicit labels due to the deep AE’s reconstruction of the traffic features. Then, the Q-function is optimized using the IFSA, which makes it easier to select the best course of action. The model can offer pseudo-labels for the unlabeled data by utilizing the learned feature representations through the incorporation of deep AE’s ability with the IFSA. As the model boosts its understanding of the labeled instances and can deduce possible attack patterns from the unlabeled data, this empowers the EDRL to capably leverage the broader distribution of network traffic. Subsequently, the amalgamation of labeled and unlabeled data progresses the model’s overall performance in real-world network environments by decreasing the dependence on large labeled datasets and improving its capability to generalize and adapt to new, unseen or sophisticated attack types. This combination not only improves the model performance in finding normal and attack classes but also ensures efficiency and scalability in real-time applications.
3.2.1. Reinforcement Learning
The difference between RL and other ML algorithms is that RL engages the decision-making process for determining the best return or reward by incessantly learning the way the agent and environment interact. The aforesaid interaction process can be designated using the MDP model. A quintuple is characterized by MDP. Among these, T designates the state set of environment, B suggests the action set the agent is capable of performing, and implies the set of state-action pairs that can be executed. A transfer function, represented by U, signifies the probability of transitioning to state after executing action in state t. Moreover, a reward function, designated by S, characterizes the expected reward value produced while ending action b in state t and later transitioning to state .
The specific interaction process between the environment and the agent is categorized into two parts. In the first part, the current state of the external environment
is perceived at time
u. Then, it decides to choose the suitable action
for executing conferring to the strategy
. In the second part, the environment responds to the agent as it updates the state
, produces a return or reward
, and then moves in the subsequent MDP process. Suppose that after a certain period of time
u, the agent interacts with the environment to attempt and gather as many rewards as possible within the environment and incessantly learns to increase the expected return. The below equation illustrates this procedure as follows:
where
is a discount factor and
. The contribution of future rewards to the anticipated return is generally determined by means of the discount factor. The agent lacks foresight if
since it solely deliberates about increasing the present rewards. On the other hand, the agent becomes more foresighted as
since it ponders more and more on future rewards. The agent receives a random sequence of variable rewards after time
u, which is characterized as
.
By offering a value function that is divided into a state-action-value function and a state-value function, RL generates the connection between the optimal policy and optimal criterion. The expected return attained by an agent utilizing policy
in state
t is reflected as the state-value function. Now, the mathematical expression is presented below as follows:
where
indicates the expected value under strategy
.
If the optimal strategy is preferred among every potential state-value function, an optimal state-value function exists. The following describes the equations for the optimal state-value function and the optimal strategy.
In the same way, the expected return of the agent executing action
b in state
t is demarcated as the state-action-value function. The following equation is used to represent this function.
Now, the equation of the optimal strategy for obtaining the optimal Q-value function is given as follows.
The Bellman optimization equation is employed to express the optimal Q-value function further. The following provides the final equation:
where
S represents the instant expected reward attained by the agent after transitioning to state
, and
for attack classification is discussed by the agent after transitioning to the state
after ending action
b in state
t.
3.2.2. Enhanced Deep Reinforcement Learning
In this section, the proposed EDQL model for attack classification is discussed. The base of the EDQL model is deep Q-network (DQN) theory, a free-environment interaction model in which the state of the succeeding phase is offered directly by the dataset rather than by interaction with the environment. A new training sample is produced during each training iteration by taking samples from the initial training sample, which encompasses the state
at the time
u, the action
at time
u, and the set of every action
and the state
at the next time. Initially, the set of actions
B and the state
are input into the neural network at time
u to predict every possible Q-value functions
The maximum Q-value function
is filtered with the policy algorithm
for state
. The IFSO is further employed in choosing the actions. The architecture of enhanced deep reinforcing learning is offered in
Figure 2.
The reward function is made up of rewards that are projected to be correct or incorrect. The return or reward is computed when relating the actual action value under state to the above-predicted action value . The reward is 1 if the two are equal; otherwise, the reward is 0. The aim of the subsequent phase is to predict the Q-value function by means of state at time . An unsupervised learning technique is employed to predict the feature labels of the test samples, henceforth improving the ability of the model to identify attacks. This is because the test samples in the dataset encompass normal and attack samples. Thereby, the proposed method first predicts the action to which they belong by means of the deep autoecoder (AE) model. The model is initially trained on the labeled data to establish a baseline understanding of attack and normal classification. The deep AE processes the unlabeled data for constructing feature representations, which are then employed together with labeled data to refine the Q-function and enhance policy selection through IFSA. As the model encounters new data during training, it can adaptively query for labels on instances and include new labeled data into the training set for repeated enhancement.
The flowchart of the semi-supervised workflow is provided in
Figure 3. The process starts with raw network traffic data, which is separated into unlabeled and labeled datasets. Initially, the labeled data are employed to train the model, whereas unlabeled data are processed through deep AE to extract the features. Then, these extracted features are utilized to approximate the Q-function and tend to optimal policy selection by IFSA for classifying traffic as attack or normal. The final outcome is the classification output, which can be further refined by continual retraining as new data become available.
Deep Autoencoder
Deep AE is a type of neural network architecture that is employed for dimensionality reduction and unsupervised learning applications. Deep AE is made up of an encoder and a decoder in which the encoder utilizes compression to transform the input data into a lower-dimensional representation represented as latent space, and the decoder employs this representation to reconstruct the original input data [
32]. Approximating Q-functions in an RL environment is one of the uses for deep AE. Q-functions in RL are applied to define the expected future reward of executing a specific action in a given state. The Q-values for each state-action grouping are stored in a Q-table in conventional RL approaches such as Q-learning. Nevertheless, owing to the vast amount of alternative states, applying a Q-table becomes unrealistic for situations with continuous or huge state spaces. In the proposed EDQL model, deep AE can be utilized to proficiently approximate the Q-function instead of a Q-table. The selection of deep AE is mainly driven by proficiently handling the high-dimensional and continuous nature of network traffic data. The network traffic data renders conventional Q-tables unfeasible owing to their inability to generalize and the exponential increase in size. In order to detect new attacks, deep AEs are excellent at compressing complex input data into lower-dimensional representations. This progresses generalization and learning stability.
Deep AE helps lessen the overestimation bias frequently connected to Q-learning by offering more precise Q-value approximations depending on learned features instead of discrete values. Also, by facilitating fast learning through experience replay and breaking correlations in training data, the deep AE can be simply combined with other unsupervised approaches for improving intrusion detection ability without necessitating a lot of manual labeling. Q-tables, shallow neural networks, and linear function approximators are some other techniques used for estimating the Q-function. Owing to their size and lack of generalization, Q-tables are not viable for high-dimensional spaces. Shallow networks, as well as linear function approximators, have issues in capturing non-linear and complex relationships in data, which results in under-fitting. In contrast to other prevailing models, the deep AE is better for approximating the Q-function due to its overall scalability, flexibility, and robustness. Also, the feature extraction and representation learning ability of deep AE makes them more effective for Q-value estimation in complex environments such as attack detection. The state representation can be the input of deep AE. This can be a high-dimensional vector that resembles the current state of the environment. In the proposed method, the deep AE model combines multiple AE layers over one another with connections from one layer to the subsequent layer. In this connection, the output from a first encoder will be the input to a second encoder, where higher feature representations can be learned. The DAE model utilized in the proposed work for approximating the Q-function is displayed in
Figure 4.
An AE is a kind of unsupervised neural network that uses backpropagation to learn from unlabeled data. For an accurate reconstruction of the state in the decoder part, the model learns the following hypothesis function:
where
w is the weight value,
is the bias value and x is the input value of a model. The AE includes one encoder and one decoder, where the encoder is responsible for converting the input to low-dimensional representation, and the decoder is responsible for reconstructing the input. The process of encoding reduces the dimensionality of a data space by converting the input into a latent space representation. This can be achieved by imposing a constraint on the hidden neurons of the encoder model to compress the input. The constraint can be formulated as follows:
where
is the average activation,
is the sparsity parameter set close to 0, the
p is the number of training instances, and
is an activation of the hidden neuron. If the activation value is 0, then the neuron is inactive; if the activation value is 1, then the neuron is considered active. The mean squared error (MSE) is chosen as the cost function of this network model, which can be mathematically expressed as follows.
Along with this loss function, the L2 regularization or weight decay term is added to minimize the overfitting issues. The L2 regularization term can be mathematically described as follows:
where
L specifies the total count of layers in the network,
n is the count of neurons in a layer
l,
k is the count of neurons in a layer
and
specifies the weight value. Sparsity regularization or a penalty term is added to the loss function using the Kullback–Leibler (KL) divergence. This adds a penalty to the values of
that diverge from the values of
. The penalty term can be described as follows:
where
specifies the count of hidden neurons in the encoder. Therefore, by adding all the above terms, the overall loss function of the network model can be expressed as:
where
and
are the terms that control the strength of the respective regularization terms. Moreover, the feature learning capability of deep AE is potentially higher than that of AE, so the deep AE model is likely to result in an effective approximation of the Q-function. In addition, large or continuous state spaces in RL problems without a Q-table are efficiently handled by utilizing a deep AE to learn a compact representation of the state space and a separate neural network to approximate Q-values. In a complex environment where conventional tabular approaches are impracticable, this approach proves to be quite supportive.
Improved Flamingo Search Algorithm (IFSA) for Optimal Policy Selection
Optimal policy selection is significant in RL for an agent to optimize its long-term reward by building the best decisions in an environment. Numerous algorithms are designed to tackle this issue. However, the existing optimum policy selection algorithms undergo limitations such as high computational costs, which can render them inappropriate for real-time applications. Numerous algorithms also face trouble with trade-offs between exploitation and exploration, which can tend to sub-optimal policies when they have not sufficiently explored the state space. In addition, the existing algorithms can have instability and convergence problems, specifically in continuous or high-dimensional state spaces. The flamingo search algorithm (FSA) [
33,
34] is one of the swarm intelligent optimization methods that draws inspiration from flamingo migration and feeding habits. This optimization algorithm utilizes the collective intelligence of flamingo populations, where members exchange information about their positions and food availability, permitting the group to efficiently explore the search area. The capability of FSA to strike a balance between global exploration and local exploitation is one of its main advantages. This progresses the convergence speed and search accuracy. The algorithm has proven to be effective and robust in resolving complicated optimization issues, outperforming alternative optimization methods over a range of test functions.
In addition, the adaptability of FSA demonstrates its practicality in real-world applications by making it appropriate for a variety of applications comprising engineering problems like network intrusion detection systems and path planning. However, sometimes the FSA has to local optima issues. The inclusion of chaotic tend mapping [
35] at the FSA initialization can support addressing the local optima problem. By employing chaos in the initial positioning of the flamingo population, the method maximizes diversity and exploration ability. This enables it to avoid local optima and upsurges the chances of discovering the global optimal solution. Thereby, in the proposed EDRL approach, the tent chaotic mapping is adapted with FSA for optimal policy selection and termed an improved flamingo search algorithm (IFSA). To accomplish the task, the environment, action, reward, and state are initialized as the initial population, and then the searching procedure is put forth. In FSA, the population members are initialized randomly according to the lower and upper bounds of the problem.
where
represents the value of the
kth variable quantified through the
jth candidate solution,
n states the problem variables,
O indicates the number of members of the population,
signifies the
kth upper bound,
defines the
kth lower bound, and Rand resembles the random number of the range
.
A chaotic tent map is engaged to interchange the randomly constructed model in the FSA to initialize the population after introducing the chaotic mapping. The above equation can be adapted as follows-
where
indicates the outcome of the chaotic sequence at the lth iteration. During this phase, the position of candidates has been initialized by engaging the chaotic map that supports improving global search performance in the FSA. This adaptation supports overcoming the local optimum problem in FSA.
The foraging model of an algorithm includes three main behaviors: communicative, beak scanning, and bipedal mobile behaviors. Here, maximum accuracy is termed the fitness for policy selection. The optimal solution is identified if the accuracy is maximized compared to the other existing solutions. The mathematical formulation of the fitness for policy selection is as follows:
where
F indicates the fitness value.
The FSA algorithm follows different strategies to identify the place with more food and also leads the group to that place to attain optimal solutions. In the proposed work, the same behavior is mimicked, and the fitness is evaluated in each iteration of the FSA to identify the optimal policy.
Foraging Model
Communicative behavior: The birds that identified with the most food spread their location information to the other birds in the group. This behavior dramatically influences the other birds in the group regarding their position changes in the search space. The birds are generally unaware of an area in the search space where the most food is available. However, the algorithm is executed by simulating the birds, trying to find the optimal solution with limited information available. This behavior is simulated in the proposed EDRL, where the search agent tries to find the most optimal value in the search space through fitness evaluations. Here, the flamingo with the most food in the population present in the jth dimension is considered as .
Beak scanning behavior: When the birds forage in the water, they dip their heads downward, turn their mouths upside down, and eat foods by discharging excess water and inedible dregs. The availability of the excess amount of food in that area impacts this kind of foraging technique. This behavior is simulated in the work where the mean values that are close to the attack are found abundant in a particular position in the search space. While this behavior is complex to implement, it can assure better and optimal solutions, as the most optimal rule can be figured out based on the comparison of fitness values. When the foraging area chosen by the flamingo is abundant in food, the flamingo carefully searches the search space by increasing its scanning radius with the help of its neck and beak. Consider the ith flamingo present in the population in the jth dimension indicated as , an error is encountered with the variability among the flamingo’s choice in foraging. This error can be simulated using a normal random distribution. Here, the bird’s beak scan has the highest probability of being aligned with the direction of a location with abundant food. The maximum distance of the beak scan of a bird can be given as , where is a random number under a normal distribution, and is a random number between −1 and 1. The scanning range of the bird can be simulated as , where is randomly obeying normal distribution.
Bipedal mobile behavior: When these birds forage following the scanning behavior, their claws move towards the area abundant with food. Based on this location, the distance traveled by these birds can be quantified as
where
is a random number between −1 and 1, and this is mainly utilized to enhance the search range of flamingos. This behavior can be simulated in the proposed work for policy selection, where the search range of the search agents is enhanced. The distance traveled by the search agents in the search space is quantified to figure out the most optimal value. Also, this behavior helps maximize the coverage of the search space, thereby leading to more optimal solutions. In the ith iteration, the movement of a bird in the foraging procedure can be modeled as follows.
The location of a bird’s foraging process can be modeled using the following formulation:
where
indicates the position of a
ith bird in
jth dimension at
th iteration,
is the position of the
ith bird in
jth dimension at the
tth iteration,
indicates the
jth dimension of a bird with the best fitness value at the
tth iteration,
is a diffusion factor following the chi-square distribution of n degrees of freedom,
and
are a random number under a normal distribution.
Migrating Model
This model specifies the migration behavior of birds towards the most food-rich area when food is scarce in the current region. Similar behavior is followed in the proposed work, where the search agent travels toward the region with more fitness values. This leads to the migratory behavior of FSA, which helps in the identification of more optimal solutions. When more optimal solutions are identified, the fitness values are compared to select the best global optimal solution. The migration behavior of the flamingos can be modeled mathematically as follows:
where
indicates the Gaussian random number with n degrees of freedom, which is specifically used in the enhancement of the search space during migration. In this way, the FSA is adapted to the optimal policy selection phase of the proposed framework to generate possible solutions for attack classification. For each iteration, this algorithm evaluates the fitness of a solution and finds the most optimal solution from the search space. In the proposed work, this algorithm is executed to find the most optimal policy for attack classification. Initially, the population of the FSA is initialized, and based on fitness, the best values are determined as output. The pseudocode of the IFSA for optimal policy selection is provided in Algorithm 1.
Algorithm 1 Pseudocode of the IFSA for optimal policy selection |
Phase 1: Initialization Define fitness for each solution Phase 2: Iterations Step1: For to maximum iterations do Consider the communicative behavior of FSA to identify the current optimal solution given as
|
|
Step 2: Determine the scanning range of FSA using the beak scanning behavior given as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3.2.3. Attack Classification Using SVM
After the approximation of the Q-function, the proposed EDRL model has employed the SVM [
36] classification model to classify the attack owing to its efficiency in managing high-dimensional data and its capability to recognize the best hyperplanes for classifying distinct classes. SVM is an effective supervised machine learning model designed for binary classification. SVMs are predominantly robust against overfitting, particularly in cases where the number of features exceeds the number of samples, which is common in network traffic data. In addition, SVMs can proficiently manage non-linear boundaries with the usage of kernel functions, making them adaptable for complex classification tasks. The existing models in intrusion detection systems for attack classification include random forests (RFs), decision trees (DTs), artificial neural networks (ANNs), k-nearest neighbors (KNNs), etc. ANNs necessitate large datasets and careful tuning. RF is more robust but can be harder to interpret, while DT is prone to overfitting and instability. KNN is sensitive to unimportant features and computationally costly. Owing to their greater vulnerability to instability, overfitting, and challenges handling high-dimensional or unbalanced data, other algorithms commonly perform inferior to SVMs, while SVMs excel in generalization and robustness in these situations. The proposed EDRL identifies the discrimination between the feature spaces, which are labeled as normal or attacked accordingly. SVM tries to identify the optimal decision boundary where the margin between the nearest data instances and a boundary is maximized. The data instances that define the maximum margin are known as support vectors. The SVM model is displayed in
Figure 5.
Consider the input training instances of a model as
, where
is the true class of input, then the decision boundary for the given data can be represented as follows:
where
w is the weight value and
b is the bias. Constraints are also considered in the model to prevent data instances from falling on the incorrect side, and these can be expressed as follows:
These two constraints can be combined into a single constraint as follows;
A well-known limitation of an SVM model is that the model’s training time is high when the input size is large. This limitation is reduced in the proposed work by administering a linear form of SVM that uses hinge loss as its loss function. This function can be defined as follows:
where the output value for this loss value is considered to be ±1. When the prediction made by
is true, then
. Also,
and
will have the same sign, and the loss value is 0. Also, there will be a linear increase in the loss value when
and
and
have opposite signs. Incorrect classifications made by the model are penalized by the hinge loss function within
corresponding to a margin.
The objective function for minimization can be formed using the loss function and regularization term. Since the considered loss function is convex, convex optimizers can be adopted to minimize this function. The objective function can be mathematically expressed as follows.
The loss mentioned above function is minimized iteratively using the gradient descent, and the overall classification accuracy is maximized. Moreover, the proposed attack classification phase is binary, resulting in either a known or unknown attack as the output.
4. Results and Discussion
Several experiments are carried out to prove the performance efficacy of a proposed EDRL model compared to other existing frameworks. The simulation scenario, metrics utilized, and performance outcomes obtained will be detailed in the upcoming sections.
4.1. Simulation Scenario
The proposed EDRL model includes different sequential phases to achieve classification and policy generation. Initially, pre-processing is performed to enhance the data for further analysis. The enhancement of DQL is carried out based on deep AE and IFSA. Three primary parts make up every DRL model, particularly actions, states, and Q-values. In the proposed EDRL model, the pre-processed data are characterized by state. When it comes to labeling, an action resembles providing a label to the data, like finding whether an instance is an attack or not. In this circumstance, the probability related to the label is enclosed in the Q-value. Training an AE with labeled data is the first step in the labeling process. The Q-network is then used to implement this AE. Within the Q-network, the role of AE is to predict the Q-values (under supervision), updating its weights and biases depending on the features that are extracted. The network can label unknown data (in an unsupervised manner) due to this update process. Inherently, the AE assists the Q-network in identifying and predicting labels for new data points by learning the labeled data it trained initially. A crucial Q-learning parameter that affects the trade-off between exploration and exploitation is optimally updated throughout the learning process using IFSO. By optimizing the parameter, the proposed EDRL model helps improve its performance and learning efficiency. The complete implementations of the work are done in the Python platform, and the system specification followed is presented in
Table 1. The system is equipped with an Intel
® Core
TM i5-3470 CPU operating at a clock speed of @3.20 GHz, featuring 4 Cores and 4 Logical Pro, building it appropriate for handling moderate computational tasks effectively. It encompasses 8 GB of RAM, with 7.85 GB usable for running resource-intensive applications like IDS simulation. The system runs on Microsoft Windows 10 Pro (version 10.0.19044 Build 19044), a 64-bit operating system that assists modern software and makes compatibility with different tools. The architecture is an x64-based processor, optimizing performance for applications designed for 64-bit systems.
Several hyperparameters are tuned initially to perform the implementations in the proposed work as different algorithms are utilized. The hyperparameter tuning of the proposed work is presented in
Table 2. By using deep AE for efficiently approximating the Q-function, four encoder and decoder layers would be an appropriate hyperparameter configuration. This permits the capturing of complex feature representations while ensuring precise reconstruction. The batch size is set to 64 for balance training stability and efficacy, and the model is trained for 100 epochs to permit adequate learning without overfitting. The ReLU activation function for the hidden layers encourages the learning of non-linear correlations, while a linear activation function in the output layer is suitable for forecasting continuous Q-values. By efficiently approximating the Q-function, this configuration permits the AE to acquire significant representations of the state-action space, enhancing performance and generalization in reinforcement learning tasks. When the data are linearly separable, it is better to use a linear kernel function in SVM for attack classification. This lowers computational complexity and simplifies the model, resulting in faster inference and training while preserving interpretability and efficacy in differentiating between normal and attack classes. Also, the gamma parameter is set to 0.25, and it supports shaping the decision boundary of SVM. Considering the FSA, the initial population is set to 50, and the maximum number of iterations is 100. A diversified representation of potential solutions is made by setting the initial population to 50, which is indispensable for efficient search space exploration. Typically, this size is satisfactory for capturing various parts of the solution landscape without overwhelming computational resources. This enables the algorithm to improve its search and enhance the quality of the solution with no excess computing expense.
4.2. Dataset Description
The evaluations of a framework are performed using the UNSW-NB15 dataset [
37,
38,
39]. This dataset is a very popular IoT-based intrusion detection dataset that mainly covers the features of diverse network attacks. It is a public source dataset and can be downloaded using the following URL [
https://www.kaggle.com/datasets/mrwellsdavid/unsw-nb15, accessed on 7 March 2023]. It was fundamentally introduced by the Cyber Range Lab of the Australian Center for Cyber Security (ACCS) in 2015 and is often used in the research community. This work evaluates the framework using the raw network packets generated using the IXIA perfect storm program. It includes nine attack features: analysis, backdoor, DoS, exploits, fuzzes, generic, reconnaissance, shellcode and worms, and normal, benign features. Forty-nine network traffic features are extracted along with their class labels using the Argus and Bro-IDS programs with the assistance of 12 developed algorithms. All the extracted features are kept in a separate .csv file named UNSW-NB15_features.csv.
This dataset includes 2,540,044 streams, where 2,218,761 streams fall under the benign streams, and 321,283 streams fall under the aggressive streams. All these data streams are placed in four separate .csv files named UNSWNB15_1.csv, UNSWNB15_2.csv, UNSWNB15_3.csv and UNSWNB15_4.csv. The ground truth table of this dataset is presented in the UNSWNB15_GT.csv file, and an event list of this dataset is presented in the UNSWNB15_LIST_EVENTS.csv file. The major attack features needed in the implementation phase are divided and placed in two separate training and testing files named UNSWNB15_trainingset.csv and UNSWNB15_testingset.csv. The training set of this dataset includes 175,341 data records, and the testing set includes 82,332 data records covering both normal and attack features. The dataset distribution based on different attack categories is presented in
Table 3.
4.3. Performance Metrics
Various performance metrics are utilized in this work to evaluate the performance of the projected framework. These metrics include precision, recall, f-measure, accuracy, false positive rate (FPR), kappa score as well as Matthews correlation coefficient (MCC), and the mathematical formulations are as follows:
where
P indicates precision,
R indicates recall,
is the f-measure,
A indicates accuracy,
indicates FPR,
is the true positives,
is the true negatives,
is the false positives and
is the false negatives.
4.4. Baseline Methods
In order to assess the performance of the proposed EDRL model, the existing methods, such as random forest (RF), CNN, GRU, and DQN, are employed and simulated to make a fair comparison. The selected models indicate a diverse set of machine and deep learning paradigms, each with distinct strengths in handling network data. The parameters utilized in the comparison experiment are associated with the settings stated in the original publications of the respective methods, guaranteeing fairness in the comparison. The summary of these existing methods is discussed below as follows:
RF: The RF classification is an ensemble method that trains several decision trees constantly using bootstrapping, averaging, and bagging. Several independent decision trees can be built simultaneously on various training sample segments by using different subsets of accessible features. The RF variance is reduced since the decision tree within the random forest is unique. The RF classifier has a significant generalization since it integrates many decision trees to reach its final decision. Also, it offers interpretability and robustness and is particularly fit for structured network data.
CNN: Learning appropriate feature representations of the input data is the aim of CNN, a neural network variant. The two primary differences between CNNs and MLPs are weight sharing and pooling. Numerous convolution kernels, which are used to create various feature maps, can make up each layer of a CNN. A neuron in the feature map of the next layer is related to each region of nearby neurons. Moreover, the kernel is shared by all input spatial locations in order to create the feature map. One or more fully connected layers are utilized for classification after a few convolution and pooling layers. CNNs are useful for network traffic representations owing to their exceptional capability to capture spatial and hierarchical patterns.
GRU: One innovative memory cell that has been shown to work well in a range of applications is a GRU. A GRU can perform similarly to LSTM and can be seen as an upgrade and simplicity of LSTM. The GRU has fewer gates than the LSTM. This is due to the fact that the GRU lacks a cell state and merges the input and forget gates into an update gate. Therefore, the GRU has a significant advantage over the LSTM in terms of performance and convergence because it is structurally simpler and uses fewer parameters.
DQN: DQN is a multi-layered neural network that generates a vector of action values for a given state. The deployment of a target network and experience replay are two crucial components of the DQN algorithm. With the exception of copying its parameters at each step from the online network, the target network, with parameters , is identical to the online network. In order to update the network for the experience replay, observed transitions are evenly sampled from this memory bank after being kept for some time. The algorithm’s performance is significantly enhanced by both the target network and the experience replay.
4.5. Performance Analysis
The detailed analysis and the comparative outcomes obtained are detailed in this section. Different methods are also considered in the evaluations to prove the performance outcomes of the proposed approach. The methods considered for comparison include random forest (RF), convolutional neural network (CNN), gated recurrent unit (GRU), and classical deep Q-learning network (DQN). A more detailed analysis of the outcomes is presented below:
4.5.1. Confusion Matrix
A confusion matrix is deliberated as a tabular representation used to assess the efficacy of the classification model, especially in terms of differentiating between normal and attack samples. Predictions are characterized into four sets, such as true negatives (TNs), true positives (TPs), false negatives (FNs), and false positives (FPs), by the confusion matrix in the framework of IDS. TPs are samples that are correctly labeled as attacks, whereas TNs are samples that are suitably classed as normal. On the other hand, FNs occur if attacks are erroneously classed as normal, and FPs occur if normal instances are incorrectly labeled as attacks. Certain attack types, including zero-day exploits or advanced persistent threats (APTs), might cause errors in the confusion matrix because they show subtle patterns or have less visibility in network data, making them more difficult to detect. Also, misclassifications may result from attacks that closely resemble normal activity, like specific kinds of denial-of-service (DoS) attacks.
Figure 6 presents the confusion matrix of a proposed EDRL model for the classification phase. The proposed classification framework classified the features into normal and attack instances based on the enhancement of DAE and IFSA. In the confusion matrix, it can be seen that very few samples are misclassified, whereas the overall classification accuracy of the approach is high. Among the samples, 36,562 samples are accurately classified as normal, which is shown as blue color in the figure, and there is a misclassification for 164 samples. In the case of attack samples, 50,806 samples are accurately classified as attacks, which is shown in navy blue color in the figure and the remaining 138 samples are misclassified as normal instances. Overall, it can be justified that there is only a minimum misclassification rate, and the overall classification rate is much better and more effective.
4.5.2. Analysis in Terms of Different Evaluation Metrics
When assessing the classification ability of an IDS, the various assessment metrics such as accuracy, precision, recall, kappa, FPR, and f-measure offer a thorough understanding of the model’s efficacy for differentiating between attack and normal instances. The performance of the IDS can be understood, and well-informed decisions about its deployment and refining can be made to improve network security by comprehensively examining these evaluation metrics.
Precision is considered one of the significant metrics since it provides information about the model’s ability to prevent false alarms by determining the percentage of accurately recognized attacks among all samples chosen as attacks.
Figure 7 presents the graphical representation of precision results for proposed and existing methods. The graph shows that the proposed approach is more effective in classification than the existing methods. The overall precision value of the proposed approach is 99.4%, whereas the precision values of RF, CNN, GRU, and DQN are 97.2%, 97.4%, 97.6%, and 98.2%, respectively.
Figure 8 presents a comparison of the graphical recall of the proposed and existing works. The graph shows that the proposed approach yielded a higher recall value than other works. The proposed approach utilized better and more crucial representations using DAE to approximate the Q-function and optimal policy selection for better classification. The DAE model effectively represents features without causing any loss of information. This specific advantage of the model supported the SVM classifier to produce stable and effective outcomes. The overall recall value of the proposed approach is 99.1%, and the recall values scored by RF, CNN, GRU, and DQN are 97.3%, 97.5%, 97.6%, and 98.3%.
The f1-score comparison of proposed and existing works is graphically plotted in
Figure 9. The figure shows that the proposed approach is more optimal and results in a better f-measure value than other methods. Since the precision and recall values of the proposed method are high, the f-measure of the proposed method is also high, assuring better classification. The overall f-measure value of the proposed method is 99.3%, whereas the f-measure values of RF, CNN, GRU, and DQN are 97.1%, 97.2%, 97.4%, and 98.1%, considerably.
Accuracy is considered a major parameter for determining the accurate classification of normal and attack samples.
Figure 10 presents the accuracy comparison of proposed and existing works regarding attack classification. The comparison shows that the proposed method’s accuracy is high compared to other works. A significant reason for high accuracy is the use of a DAE model and IFSA to enhance the RF and improve classification performance. The IFSA’s optimal policy selection effectively supports the SVM classifier to provide better classification results. Among the compared methods, the EDRL model resulted in a better accuracy score, and the RF classifier resulted in a low accuracy score. The overall accuracy value attained by the proposed model is 99.6%, whereas the accuracy values attained by RF, CNN, GRU, and DQN are 97.4%, 97.7%, 97.8%, and 98.5%.
The FPR comparison of the proposed and existing works is presented in
Figure 11. The figure shows that the FPR of the proposed work is comparatively lower than the other works. This indicates that the misclassifications produced by the proposed method are very low, resulting in effective classification. Among the compared methods, the value of FPR for the RF model is lower than the other three models. The overall FPR value of the proposed approach is 0.003%, whereas the FPR values of the compared methods RF, CNN, GRU, and DQN are 0.028%, 0.026%, 0.021%, and 0.015%, respectively. Even if the model shows a significant FPR, the real-world implementation of such a system in different use cases may need higher FPR thresholds, particularly in sensitive environments like healthcare systems, financial institutions, or critical infrastructure where false alarms could result in major resource wastage or operational disruptions. Through the combination of an IFSA with a deep AE for approximating the Q-function and optimizing policy selection, a more sophisticated method to manage FPR is obtained. The likelihood of misclassifying normal traffic as attacks is minimized by using the deep AE to support the model in learning typical traffic patterns and differentiating them from anomalies. Furthermore, the decision thresholds can be altered using the IFSA in accordance with the specific needs of the deployment environment. Owing to this flexibility, practitioners can establish more stringent thresholds if needed since the model can be customized to adjust its sensitivity to false positives.
The comparison of the kappa score of the proposed and existing works is presented in
Figure 12. The Kappa score is reflected as a useful metric to evaluate the agreement between the actual and predicted classifications while comparing the performance of the proposed EDRL model and existing solutions for classifying attacks and normal instances in IDS. Better agreement between actual and predicted classifications is designated by a higher kappa score in the EDRL model, which recommends improved detection abilities for both attack and normal instances. However, a lower Kappa value resembles the shortcomings of the proposed EDRL model in contrast to existing methods. In the graphical depiction, it is perceived that the proposed EDRL model is in contrast to existing methods. In the graphical depiction, it is perceived that the proposed EDRL model has achieved a maximum kappa value of 99.2%, which is higher than the existing methods. The MCC derived from the proposed EDRL and existing methods for attack detection are shown in
Figure 13. Compared to the current methods, the proposed EDRL achieved higher MCC values of 99.21%. The EDRL has successfully predicted the class labels based on the input parametric values, demonstrating its superiority over current methods. MCC values of 93.21%, 94.57%, 95.12% and 96.43% have been achieved using the current methods for RF, CNN, GRU and DQN.
The performance comparison of the proposed approach in terms of different metrics is presented in
Table 4. From the performance outcomes, it can be justified that the proposed approach is comparatively better than the other methods. The results are taken for the classification performed by the proposed EDRL approach. The overall simulation outcomes proved that the proposed model is more stable and optimal than the other existing frameworks.
4.5.3. Analysis of ROC
The proposed method uses the SVM classifier to label the input obtained from the EDRL model. The classification performance is enhanced using the feature representations obtained from the DAE model.
The receiver operating characteristic (RoC) curve of the proposed approach is presented in
Figure 14. From the curve, it can be assured that the true positive rate (TPR) offered by the proposed approach is predominantly higher than its FPR. The area under the curve (AUC) value of the proposed approach is 0.997, which proves that the proposed approach provides better and more accurate classification results with minimum FPR. When the FPR is low, the proposed approach produces very low false classification and better accuracy. The proposed work uses the SVM model to classify normal and attack instances, and this model resulted in better TPR during the entire classification phase.
4.5.4. K-Fold Cross Validation
The suggested EDRL approach applies the K-fold cross-validation (K = 30) technique and analyzes performance metrics, such as accuracy for attack classification. The analysis of K-fold cross-validation is provided in
Table 5. A 30-fold cross-validation creates 30 random-fold integrations for training, validation, and testing data. The EDRL approach is iterated thirty times and trained, tested, and validated on the employed dataset. Furthermore, the suggested classifier is turned on to facilitate learning for all-fold cross-validation. Cross-validation involves folding the input dataset into a 30-fold set. In the proposed IDS, the suggested EDRL approach outperforms the current approaches for 30-fold cross-validation. Depending on the type of approach, each fold cross-validation has a distinct performance.
4.6. Analysis with Additional CICIDS2017 and NSL-KDD
In this section, the effectiveness of the proposed EDRL model, focussing on efficient and accurate attack detection, is assessed with additional datasets such as CICIDS2017 and NSL-KDD. It also supports assessing the generalization ability and determining the suitability of the proposed EDRL model in diverse IDSs.
Table 6 deliberates the performance comparison of the proposed EDRL and existing works using CICIDS2017 and NSL-KDD. In the table, it is clearly noticed that the EDRL model has achieved maximum performance in terms of accuracy, precision, recall and f1-score compared to the existing methods. The maximum accuracy, precision, recall, and f1-score values attained by the proposed EDRL model using the CICIDS2017 dataset are 99.93%, 99.78%, 99.57%, and 99.49%. Similarly, the proposed EDRL model has attained maximum accuracy, precision, recall, and f1-score values of 99.42%, 99.32%, 99.12%, and 99.15% using the NSL-KDD dataset. This demonstrates the generalizability and effectiveness of the proposed EDRL model in IDS.
4.7. Comparison with State-of-the-Art Methods
In this section, the result of the EDRL model for attack classification is compared with advanced state-of-the-art approaches to determine its unique strength for intrusion detection. From the literature, the state-of-the-art approaches like SSDDQN [
28], deep SARSA-based RL [
29], GRU-GWO [
30], LSTM-GWO [
30], and CNN+BiLSTM [
31] are considered. Other state-of-the-art models such as micro reinforcement learning architecture (MRLC) [
40], DRL with CNN (Soft DQN (SDQN)) [
41], deformable vision transformer (DE-ViT) [
42], and conditional tabular generative adversarial network (CTGAN) [
43] are also included to determine the ability of the EDRL model. An IDS based on MRLC [
40] is presented to improve IDS accuracy by making use of a fine-grained learning framework. In [
41], anomaly behavior in network systems is analyzed through the use of CNN architecture and DRL. The created system can generate patterns suitable for categorization using a large number of informative data and the Markov decision technique. Two viewpoints are used to analyze DQL: Soft DQL (SDQL), Double DQL, and Soft double DQL. Among the examined models, SDQL has attained maximum accuracy. In DE-VIT [
42], a deformable attention mechanism module is included in order to focus on relevant areas, capture more valuable features, and prevent excessive memory and processing costs. In [
43], a CTGAN has been employed in order to solve the problem of data imbalance in the dataset. Notable gains in detecting network intrusion are attained by training three shallow binary classification algorithms (logistic regression, decision trees, and Gaussian naive Bayes) on both the original imbalanced dataset and the CTGAN-balanced data. In particular, CTGAN uses a final XGBoost meta-classifier by using a new two-stage label-wise ensembling process.
Table 7 displays the evaluated results of EDRL and state-of-the-art approaches in terms of accuracy for detecting intrusion in the network. In the tabular representation, it is perceived that the EDRL model has reached better results in attack classification than the state-of-the-art approaches due to the capability of DAE and IFSA that support the DQN model.
4.8. Discussion
The proposed EDRL model in this research combines the benefits of deep AE and IFSA for enhancing normal and attack detection. The deep AE in DRL is a powerful tool for extracting features and minimizing dimensionality. It permits the model to learn complex patterns in both labeled and unlabelled data with no need for extensive labeling. Through the reconstruction of input traffic features, the deep AE captures the structure of normal behavior, which is vital to distinguish normal and malicious activities. Meanwhile, IFSA facilitates optimal policy selection by optimizing Q-function. This enables the model to adaptively adjust its decision-making process depending on the learned representation for AE. This not only enhances the model’s capability to detect attacks but also improves the robustness in handling dynamic and diverse network environments. Moreover, the proposed EDRL’s semi-supervised methodology enables it to use a wider data range, reducing the requirement for labeled instances while upholding good detection performance. This makes it especially appropriate for real-world applications where labeled data may be costly or scarce to obtain. Despite its effectiveness, the proposed EDRL model has drawbacks with regard to scalability and training time when used on larger datasets because of the deep AE’s complexity, and the optimization process using the IFSA can result in higher computational demands. In addition, there are difficulties in adapting the model for real-time intrusion detection, comprising the necessity for quick decision-making and the capability to handle streams of data with high velocity. However, effective feature extraction through the incorporation of AE lowers the dimensionality of the input data and speeds up inference and training. Scalability and real-time application issues can be effectively addressed by adjusting the IFSA’s optimization capabilities to improve the model’s responsiveness and maintain high performance even in dynamic circumstances.
As an advanced layer for threat classification and anomaly detection, the EDRL model can be easily included in current security infrastructures. Because of the potential to learn from both labeled and unlabelled data, it can adapt to varying threat landscapes and uninterruptedly improve its detection capabilities when new attack patterns appear. In low-resource environments, like Internet of Things (IoT) networks, the effective feature extraction ability of the model using the deep AE lowers computational overhead, permitting it to function well on devices with constrained processing power. In addition, the framework is ideal for implementation in a variety of environments since the IFSA’s optimization may be adjusted to highlight resource conservation while retaining detection accuracy. In addition to enhancing an organization’s security posture, this flexibility upsurges the model’s applicability across different sectors with distinct threat detection challenges. Furthermore, the proposed EDRL model improves adaptability to hidden or complex attacks by learning generalized feature representations. However, to further resolve labeled data limitations, the model can utilize active learning to selectively query for labels on uncertain samples and use continual retraining to update the model with new data. This method offers a reliable way to enhance detection abilities in a dynamic environment and remain effective against evolving threads.
5. Conclusions
In this work, an enhanced Deep Reinforcement Learning (EDRL) based model is developed for attack classification (normal and attack) to resolve the problem of network attacks and their severity. The proposed model involves different phases, and effective algorithms are implemented for better performance. The proposed work is implemented in Python and evaluated using the popular UNSW-NB15 IoT intrusion detection dataset. The classification outcomes of the proposed EDRL are evaluated and compared with existing works. From the results, it has been identified that the proposed framework is more optimal and effective in classification due to the application of DAQ and IFSA. The proposed model on evaluations resulted in a precision value of 99.4%, recall value of 99.1%, f1-score value of 99.3%, kappa value of 99.2%, accuracy value of 99.6% and FPR value of 0.03% using UNSW-NB15. Using CICIDS2017 and NSL-KDD datasets, the maximum accuracy achieved by the EDRL model is 99.93% and 99.42%. The proposed EDRL framework has resulted in higher accuracy than the existing classification frameworks and proved the strength of the model. In the future, alternative metaheuristic algorithms or hybrid optimization techniques and adaptive learning policies will be explored to boost hyperparameter tuning and dynamically update the model based on the emerging attack patterns. Further, there will be a focus on a hybrid model that combines supervised learning with RL policies and also incorporates transformer-based architecture for handling temporal and contextual aspects (such as user behavior and network topology) of network traffic more effectively. Moreover, complex settings under real-time scenarios can also be followed in implementations to prove the practicality of the work.