1. Introduction
Network security ensures information confidentiality, integrity, and availability in today’s increasingly digital world. As gatekeepers to a network, firewalls play a crucial role in managing and filtering network traffic, thus protecting the system from potential threats [
1]. However, manually configuring and updating firewall rules can be challenging, especially given the exponential growth of data traffic and the rapid evolution of cyber threats.
A firewall operates as a fundamental component of network security, meticulously regulating both inbound and outbound data traffic by applying a defined set of rules or conditions. Its primary function is safeguarding an organization’s digital infrastructure from external threats. Presently, the usage of firewalls extends beyond their conventional protective role, being integrated into various internal domains of organizational networks. Illustratively, firewalls have been leveraged to restrict employee access to sensitive internal systems such as those for financial management or human resource administration [
2]. Five distinct categories exist: packet filtering firewalls, circuit-level gateways, application-level gateways (commonly referred to as proxy firewalls), stateful firewalls, and the more recent incarnation known as the Next Generation Firewall (NGFW). Beyond their principal role, firewall devices and services encompass a broader spectrum of security functions. For instance, they offer systems for intrusion detection or prevention (IDS/IPS), safeguards against Denial of Service (DoS) attacks, session surveillance, and an array of additional protective services, thereby fortifying the security architecture of the network [
3].
Packet filtering firewalls are a widely used tool for network security. Network administrators set firewall rules to control firewall policies. The firewall examines each packet containing user data and control information and tests them against a predetermined set of rules. If the packet completes the test, the firewall allows it to reach its target. It refuses those who fail the test. Firewalls test packets by examining rule sets, protocols, ports, and destination addresses [
4]. To maintain a secure network, it is essential to understand how packet filtering firewalls work and how to set firewall rules properly [
5].
Firewall logs are valuable sources of evidence for network security, but analyzing them can be challenging [
6]. Research in network intrusion detection often focuses on using anomaly detection to find attacks. Anomaly detection involves monitoring a network’s activity for deviations from average profiles learned from benign traffic using machine learning tools [
7]. To confront this exigency, the domains of artificial intelligence, machine learning, and deep learning have surfaced as formidable technological instruments for the construction of rigorous security infrastructures capable of expeditiously managing intricate cyberattacks. Various machine learning methodologies, encompassing multiclass Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs), have been efficaciously applied for packet categorization and filtration within firewall systems, exemplifying the profound potential of these techniques in fortifying cybersecurity measures [
8,
9,
10,
11]. ANNs offer a promising solution among these techniques. Inspired by the functioning of the human brain, ANNs are designed to learn from data and improve their performance over time [
12], making them well-suited for detecting patterns in network traffic and enhancing firewall efficiency [
13].
The purpose of this research is multifaceted. It addresses critical aspects of firewall packet classification within the network security framework. The principal objective is to introduce and scrutinize a novel methodology incorporating ANNs to efficiently classify network packets within firewall systems, responding to the escalating complexities and diversities in network traffic and the intensifying spectrum of cyber threats.
The research is motivated by two overarching goals:
Data Preparation and Structuring: The first goal is to delineate a coherent and systematic approach for the preprocessing and structuring of raw network traffic data, making it amenable to the rigorous training of machine learning models. The study aims to elucidate the intricacies of data transformation, offering a structured pathway to convert raw network traffic data into an optimal format for training machine learning models.
Model Evaluation and Result Interpretation: The second goal is to delve into the interpretative aspects of the model’s outcomes, emphasizing seminal classification metrics such as precision, recall, F1-Score, and accuracy to ascertain the model’s proficiency and reliability in classifying network packets.
This research acknowledges the challenges posed by class imbalances, particularly in the ‘reset-both’ category, and addresses these issues through the strategic integration of advanced data balancing techniques such as the Synthetic Minority Over-sampling Technique (SMOTE), ADASYN, and BorderlineSMOTE. By incorporating these methods, the study significantly enhances the model’s predictive accuracy and reliability across all classes, thereby advancing the overall effectiveness of firewall packet classification.
In exploring the potential of Artificial Neural Networks (ANNs) within the domain of network packet classification, this research underscores the critical role of data resampling techniques in refining model performance. The study presents a forward-thinking, data-driven approach that bridges the fields of machine learning and network security, aligning with contemporary advancements in the field. It contributes to the evolution of network security by promoting more automated and precise methodologies for firewall packet classification.
This research stands as a seminal contribution by integrating ANNs with SMOTE, ADASYN, and BorderlineSMOTE to develop a robust and efficient framework for firewall packet classification. By combining these advanced methodologies, the study transcends traditional approaches, offering improved solutions to the increasingly complex challenges inherent in network security. This structured approach not only refines the classification process but also enhances the broader understanding and application of machine learning techniques in cybersecurity.
Addressing critical issues such as the exponential growth of data traffic, the rapid evolution of cyber threats, and the inherent challenges of data imbalance, the proposed framework significantly improves the accuracy, reliability, and real-time processing capabilities of firewall packet classification systems. Leveraging the adaptive learning capabilities of ANNs alongside the class-balancing strengths of SMOTE, ADASYN, and BorderlineSMOTE, this research provides comprehensive methodologies for data preprocessing, model training, and evaluation, ensuring solutions that are both practical and scalable. Ultimately, this work underscores the importance of merging theoretical advancements with practical implementations, setting a new benchmark for firewall packet classification and contributing to a more secure and resilient network environment.
2. Related Work
The intricate confluence of machine learning and network security has spurred extensive research, underscoring the paramountcy of ANNs in network traffic classification. This area’s burgeoning interest necessitates a granular examination of related works to delineate our study’s contextual fabric and underscore its significance.
Garcia-Teodoro et al. [
14] spearheaded investigations in this domain, providing seminal insights into leveraging diverse machine learning techniques, including neural networks, for intrusion detection systems. This research is pivotal, laying the foundational understanding of the synergy between machine learning and network security and guiding our endeavor to explore advanced techniques for enhanced efficacy. Similarly, Sommer and Paxson [
7] highlighted the pressing need for robust machine-learning models to navigate the multifaceted landscape of network traffic data. Their emphasis on resilient models informs our approach, focusing on developing ANNs to address the highlighted complexities [
7].
Each subsequent study illuminates distinct facets of this domain. For instance, Ucar and Ozhan [
15] concentrated on the adeptness of various data processing methodologies in analyzing firewall log data. On the other hand, Ertam and Kaya [
9] and AL-Tarawneh and Bani-Salameh [
8] explored classification approaches, providing comparative evaluations of classifiers’ efficacy using different metrics. These studies enriched our comprehension of classification efficacy, steering our selection and evaluation of techniques.
Naseer et al. [
16] contrasted deep learning methodologies with conventional machine learning methodologies for intrusion detection, highlighting the former’s superiority. This understanding motivated our deep exploration into advanced learning methodologies for optimizing anomaly detection. Moreover, a comprehensive review by Salman et al. [
17] unraveled the multifaceted applications and challenges inherent in internet traffic classification, illuminating the nuances of machine learning-based approaches. Zhao et al. [
18] and AL-Behadili [
19] probed into the enhancement of firewall systems through diverse machine learning models. Their findings corroborate the transformative potential of machine learning models in fortifying firewall operations and network security.
Andalib and Babamir [
20] and Enosh Shaoul and Sonare [
21] accentuated the formidable capabilities of machine learning algorithms coupled with big data in detecting anomalies in firewall policies and identifying complex Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks, respectively. Integrating big data with machine learning in these studies informed our methodology, as it emphasized the value of leveraging diverse datasets and sophisticated algorithms for precise anomaly detection.
Enosh Shaoul and Sonare [
21]’s research is pivotal as they utilized a convolutional neural network (CNN) model to detect complex DoS and DDoS attacks, achieving 98% accuracy. This research not only underscores the potential of CNNs in recognizing various attack patterns but also sets a benchmark for our study, inspiring advancements in network security. Drawing parallels, Zhang et al. [
22] delved into a comparative study, introducing a new neural network architecture—MODLSTM. Their innovative approach demonstrated substantial potential in detecting DoS attacks, contributing pivotal insights for enhancing the reliability of firewall systems against diverse malicious traffic, which is closely related to our research focus. Building on these insights, Al-Haija and Ishtaiwi [
23] engineered an intelligent classification model for firewall systems, utilizing a shallow neural network (SNN). Their model, attaining 98.5% accuracy, surpassed several contemporary firewall classification systems, emphasizing the significance of incorporating advanced models in firewall systems—a concept central to our study. Further, Marques et al. [
24] developed a dataset based on accurate DNS logs and employed various machine-learning methods. Their findings, especially the superior performance of the CART algorithm, highlighted the importance of effective feature selection methods and profoundly impacted our approach to data classification and feature selection.
Rahman et al. [
25] emphasized the significance of measurement data in enhancing classifier performance, achieving a harmonic mean F1-Score with 99% accuracy using a random forest (RF) technique. Their methodology and results provided critical insights into optimizing classifier performance, steering our study toward developing robust and accurate classification models.
As cloud computing becomes increasingly central due to data security concerns and cyber threats, the development of practical intrusion detection systems (IDSs) has gained importance. Machine learning-based IDSs are proficient at detecting network threats, yet they experience performance degradation with large datasets. Bakro et al. [
26] have proposed a hybrid approach that addresses the challenges of imbalanced datasets by integrating SMOTE with various feature selection techniques. This approach, tested using an RF model, achieved 98% and 99% accuracy rates on the UNSW-NB15 and Kyoto datasets, respectively, demonstrating superior performance compared to similar studies.
Bakro et al. [
27] presented an advanced IDS for cloud environments in their study. This system enhances classification accuracy by optimizing feature selection and weighting the ensemble model using the crow search algorithm (CSA). The ensemble classifier incorporates machine and deep learning models, including RF, SVM, XGBoost, and Decision Tree (DT). Tested on NSL-KDD, Kyoto, and CSE-CIC-IDS-2018 datasets, the ensemble model achieved high accuracy, recall, and precision levels. The study particularly highlights the superiority of the ensemble model in terms of the false positive rate (FPR) and false negative rate (FNR).
In recent years, the evolution of deep learning has introduced novel methodologies like the Deep Packet method [
28,
29,
30]. Due to its capacity to uncover hidden network patterns, this method offers a blueprint for our study, allowing us to delve deeper into network behavior analysis.
In summary, the existing body of research demonstrates significant progress in the application of machine learning techniques for firewall packet classification, with models like Random Forests (RFs), Support Vector Machines (SVMs), and K-Nearest Neighbors (KNNs) showing high accuracy across various datasets. However, a common limitation across many of these studies is the insufficient focus on addressing class imbalance, a critical issue that can skew results and compromise the reliability of classification models. While methods such as under-sampling and basic preprocessing have been applied, they often fall short of effectively balancing the data. Our study builds upon these foundational works by integrating advanced data balancing techniques—SMOTE, ADASYN, and BorderlineSMOTE—into an Artificial Neural Network (ANN) framework. This approach not only enhances the accuracy and reliability of classification across all classes, including minority ones, but also sets a new benchmark in the field of firewall packet classification, addressing the gaps identified in previous research.
Table 1 presents a comparative analysis of the success rates in using ANNs and various machine learning techniques for firewall packet classification across different datasets and feature sets.
In conclusion, our study makes a significant contribution to the field of network security by addressing the critical challenge of firewall packet classification through the innovative combination of Artificial Neural Networks (ANNs) and advanced data balancing techniques, including SMOTE, ADASYN, and BorderlineSMOTE. Unlike previous studies, such as those by Aljabri et al. [
6] and AL-Tarawneh and Bani-Salameh [
8], which primarily relied on traditional machine learning models like Random Forests (RFs) and K-Nearest Neighbors (KNNs) without specifically addressing class imbalance, our approach directly tackles this issue, which is often a limiting factor in achieving high accuracy and reliable classification performance.
The primary objective of our research was to improve the accuracy and reliability of classifying firewall packets, particularly in the face of class imbalances that can compromise the effectiveness of security systems. While studies like those by Ertam and Kaya [
9] and Ucar and Ozhan [
15] demonstrated high accuracy with methods like Support Vector Machines (SVMs) and KNNs, they did not incorporate strategies to mitigate the impact of imbalanced datasets. By carefully analyzing network traffic data and implementing these advanced balancing techniques, we developed a robust model that achieves enhanced classification performance across all packet classes, including those that are traditionally underrepresented.
This study not only advances the current understanding and application of machine learning in network security—where prior works such as those by AL-Behadili [
19] and Rahman et al. [
25] showed the efficacy of Decision Trees (DTs) and Random Forests—but also establishes a solid foundation for future research. By explicitly integrating data balancing methods that adaptively generate synthetic samples based on the difficulty of classifying minority classes, as demonstrated in our approach, we aspire to set a new standard for firewall packet classification methodologies, addressing the shortcomings of previous studies that often overlooked the complexities introduced by class imbalance. This research, therefore, builds upon and significantly contributes to existing work in the field, offering a more comprehensive and effective solution to the challenges of network security.
3. Materials and Methods
3.1. Firewall Data
The dataset employed in the analysis presented in
Table 2 encompasses a variety of attributes pertinent to network traffic. These attributes include the source port, destination port, Network Address Translation (NAT) source port, NAT destination port, and metrics related to data flow, such as the volume of bytes and the number of packets sent and received. Additionally, the dataset captures the duration of network activities, providing a comprehensive overview of network traffic characteristics essential for effective firewall packet classification. Furthermore, the dataset records the action taken in response to the network traffic, with activities such as allowance, denial, dropping, or reset-both [
36].
Here is a brief description of each of the fields in the dataset:
Source Port: this designates the port number on the originating machine from which the network traffic is initiated.
Destination Port: this signifies the port number on the recipient machine to which the network traffic is directed.
NAT Source Port: NAT, an acronym for Network Address Translation, is a methodology firewalls use to correlate an internal (private) IP address to a distinctive public IP address. This field represents the source port number post the application of NAT.
NAT Destination Port: this constitutes the destination port number after the NAT implementation.
Bytes: this quantifies the cumulative number of bytes constituting the data in the network packet.
Bytes Sent: this represents the volume of data bytes dispatched from the source to the destination.
Bytes Received: this demarcates the quantity of data bytes that the source accrues from the destination.
Packets: this denotes the overall count of network packets embodied in the network traffic.
Elapsed Time (s): this signifies the total duration of the network communication.
pkts_sent: this quantifies the network packets transmitted from the source to the destination.
pkts_received: this enumerates the network packets the source garners from the destination.
Action: the “Action” field in a firewall denotes the prescribed response to network packets and acts as a categorical variable, including ‘allow’, ‘deny’, ‘drop’, and ‘reset-both’ as potential values. ‘Allow’ permits passage of the packet through the network; ‘deny’ blocks it; ‘drop’ silently discards it without notifying the source; and ‘reset- both’ sends a reset packet to both source and destination, terminating the connection. As demonstrated in
Figure 1, the distribution of actions is as follows: ‘allow’ at 37.640 instances, ‘deny’ at 14.987, ‘drop’ at 12.851, and ‘reset- both’ at 54, reflecting the varying frequencies of each response type in the analyzed dataset.
3.2. Limitations
The dataset reveals a significant number of instances classified as “allow” (37.640 occurrences), indicating a prevalence of packets allowed by the firewall. Additionally, there are actual occurrences of samples classified as “deny” (14.987 occurrences) and “drop” (12.851 events), representing packets that were denied or dropped by the firewall, respectively. However, the subclassification class “reset-both” is severely underrepresented in the dataset, with only 54 occurrences. This class imbalance raises concerns about the model’s performance and ability to predict the “reset-both” subclassification accurately. Moreover, the restricted quantity of instances encapsulated in the “reset-both” class might constrain the model’s capacity to generalize and apprehend the complexity inherent in real-world situations, particularly those where the firewall undertakes the reset of packets. Furthermore, within the methodological constraints of the study, the analysis was conducted solely using ANNs and SMOTE. The sequential process adhered to within this research study is depicted in
Figure 2.
3.3. Data Analysis
The aim is to train an ANN model to predict the ‘Action’ taken given the other network features. The model is implemented in Python using the Keras library with the TensorFlow backend.
The method used in the analysis can be outlined as follows:
Data Preparation: The dataset is first loaded into a pandas DataFrame. The target variable ‘Action’ is separated from the input features. The input features are normalized using the StandardScaler from the sklearn library, which transforms the data such that their distribution will have a mean value of 0 and a standard deviation of 1. The target variable is label-encoded and then one-hot-encoded to transform into a binary class matrix for use with the categorical cross-entropy loss function during training.
Data Splitting: The dataset is subjected to a partitioning procedure to create distinct training and testing subsets. Specifically, 80% of the dataset is allocated for training purposes, establishing the foundation for the model’s learning. The remaining 20% is reserved for testing, providing an unbiased evaluation of the model’s predictive capability on unseen data.
3.4. Artificial Neural Networks (ANNs)
ANNs epitomize a subset of machine learning models that derive their foundational concept from biological neural networks’ structural and operational paradigm. ANNs encompass a network of interconnected artificial neurons systematically organized in layers. These layers are instrumental in processing and disseminating information via weighted connections [
37]. The artificial neuron mimics a biological neuron’s input, processing, and output properties.
Figure 3 shows the results produced by the network. The net input obtained by multiplying the information entered into the network by its weights (W) is processed with the transfer function and taken from the output layer [
38,
39,
40].
Classification employing ANNs necessitates training a neural network model to categorize data into distinct classes or categories. ANNs have been extensively applied to classification tasks due to their ability to learn intricate patterns and relationships in the data. The model encompasses an input layer, a hidden layer equipped with ten neurons, a subsequent hidden layer endowed with five neurons, and an output layer having a neuron count equivalent to the number of classes within the classification task. The Rectified Linear Unit (ReLU) is utilized in the hidden layers, while the sigmoid activation function is employed in the output layer.
Multilayer Perceptrons are used to learn if there is no linear relationship between the inputs and outputs of ANNs. Therefore, this method was used in the study.
Model Architecture: A sequential ANN model is constructed using the Keras deep learning framework. This model encompasses an input layer, two hidden layers, and an output layer. The input layer is designed with a neuron count equivalent to the number of input features. The two concealed layers possess 10 and 5 neurons, applying the Rectified Linear Unit (ReLU) as their activation function. The output layer comprises a neuron count mirroring the number of classes in the target variable and employs the Sigmoid activation function, hence providing class probabilities as the output.
Model Compilation: The categorical cross-entropy loss function is widely recognized as an optimal choice for multi-class classification problems due to its ability to effectively measure the discrepancy between predicted probabilities and actual class labels. The model undergoes compilation utilizing the Adam optimization algorithm and the categorical cross-entropy loss function, an optimal arrangement for addressing multi-class classification quandaries [
41].
Model Training: the model is trained on the training data for a specified number of epochs (50). The batch size (50) is also specified.
Model Evaluation: the model’s performance on the test set is evaluated, and the accuracy of the predictions is reported.
Results Visualization: A confusion matrix and a classification report are constructed to delineate the model’s performance on the test set. The confusion matrix provides a graphical representation of the model’s predictive success and failure for each class, enumerating correct and incorrect predictions. In parallel, the classification report offers salient classification metrics such as precision, recall, and F1-Score for each class. These tools collectively facilitate a comprehensive understanding of the model’s class-wise performance.
The Sigmoid and ReLU activation functions used in the ANNs are calculated as follows:
3.5. Synthetic Minority Over-Sampling Technique (SMOTE)
In machine learning classification, imbalanced data refers to datasets where the number of instances belonging to different classes is uneven, leading to a potential bias in the classifier’s performance. SMOTE, introduced by Chawla et al. [
42], is an oversampling technique that aims to overcome the imbalance problem by generating synthetic examples for the minority class. The method works by randomly selecting a member from the minority class and identifying its K-Nearest Neighbors (k-NNs). From these neighbors, SMOTE creates new synthetic examples along the line segment connecting the minority class member and its neighbors [
42].
Generating synthetic examples involves selecting a minority class instance and calculating the feature-wise differences between it and its neighbors. SMOTE then randomly chooses a number between 0 and 1, multiplying the feature-wise differences by this value. The resulting values are added to the selected minority instance, producing new synthetic examples that represent the minority class but differ slightly in their feature values. Applying SMOTE makes class distribution more balanced, as artificial models are introduced to augment the minority class. This helps to alleviate the bias caused by imbalanced data and allows the classifier to learn from a more representative dataset [
43]. The minority class member generation diagram based on SMOTE and k-NN algorithms is shown in
Figure 4 below.
3.6. BorderlineSMOTE
BorderlineSMOTE is an advanced oversampling technique developed to improve upon the traditional SMOTE method, which generates synthetic samples for minority classes using their nearest neighbors [
42]. A key limitation of SMOTE is its tendency to cause overlap between minority and majority class samples, leading to potential misclassification. To address this, Han et al. introduced BorderlineSMOTE, which focuses specifically on minority class samples near the decision boundary—those most susceptible to misclassification [
45].
The BorderlineSMOTE algorithm identifies the K-Nearest Neighbors of each minority class sample. If a sample is surrounded predominantly by majority class neighbors, it is placed in a DANGER set, as it is more likely to be misclassified. Synthetic samples are then generated for these DANGER set instances by interpolating between the original sample (
Pd) and its nearest neighbors (
Pk). The new synthetic sample (
Pnew) is calculated using the following formula [
46,
47]:
Here, rand(0, 1) is a random number between 0 and 1, ensuring that the synthetic sample is located along the line segment between the original sample and its nearest neighbor.
By focusing on generating synthetic samples near the decision boundary, BorderlineSMOTE reduces the risk of class overlap and improves the classifier’s ability to distinguish between classes, particularly in datasets with significant class imbalance.
3.7. Adaptive Synthetic Sampling (ADASYN)
ADASYN, introduced by He et al. [
48], is an adaptive oversampling technique designed to address class imbalance by generating synthetic samples for the minority class. Unlike the traditional SMOTE method, which generates an equal number of synthetic samples for each minority instance, ADASYN prioritizes those minority samples that are harder to classify, thereby producing more synthetic data around these challenging instances. This approach reduces the learning bias introduced by imbalanced data and enhances the classifier’s ability to correctly classify minority class instances [
49].
3.8. Model Performance and Evaluation
The TP, TN, FP, and FN confusion matrix metrics in
Table 3 provide values for correct or incorrect classification of packets at the firewall. Recall, also known as sensitivity, measures the model’s ability to identify positive cases correctly. It is computed as the ratio of accurate positive predictions to the total number of actual positive cases. Precision evaluates the proportion of relevant instances among those retrieved by the model, defined as the ratio of true positive predictions to all positive predictions made by the model. Accuracy represents the overall effectiveness of a model in making correct predictions, calculated as the proportion of true predictions (both true positives and true negatives) to the total number of cases. F-Score, or F1-Score, integrates precision and recall into a single metric by calculating their harmonic mean. This metric is handy when seeking a balance between precision and recall, providing a more holistic evaluation of model performance. These values were used to calculate precision, recall, f-measure, and accuracy metrics as follows:
4. Results
In this investigation, the proficiency of the ANN model in classifying firewall packets based on their attributes is scrutinized. A range of widely accepted performance metrics, encompassing the confusion matrix, accuracy, precision, recall, and F1-Score, are employed to comprehensively evaluate the model’s performance.
As presented in
Table 4, the initial experiments involved using ANNs without SMOTE to classify firewall packets. The results revealed that for the “allow”, “deny”, and “drop” classes, the model achieved perfect precision, recall, and F1-Score values (1.00), indicating flawless performance. However, for the “reset-both” class, all these metrics were at 0.00, indicating that the model could not accurately predict any instance of this class. Despite the issue with the “reset-both” class, the model’s overall accuracy was still extremely high (1.00), as measured across a total of 13.107 samples.
The confusion matrix delineated in
Figure 5 illuminates the proficiency of the ANNs in categorizing network actions into ‘allow’, ‘deny’, and ‘drop’. The ANN exhibits high accuracy with 7521 correct predictions for ‘allow’ and a complete accuracy for ‘drop’ actions at 2562 predictions. While the ‘deny’ class has a notable accuracy with 2939 correct classifications, there is a small fraction of misclassifications, with 55 instances erroneously predicted as ‘drop’. The model notably underperforms in the ‘reset-both’ category, misclassifying all instances into other categories, indicating a need for model refinement or reconsideration of the feature set for this particular class to improve its predictive capability. The results highlight the strengths and areas for improvement in the ANN’s capacity to discern between classes, particularly when handling imbalanced datasets or underrepresented classes.
Figure 6 presents the graphs depicting the model’s training and validation accuracy and its training and validation loss.
Figure 6 presents the accuracy and loss graphs for the training and validation phases of the ANN model. The accuracy graph shows a rapid increase during the initial epochs, with both training and validation accuracy quickly reaching and stabilizing around 99%. This high level of accuracy, maintained consistently across epochs, indicates that the model is well-trained and highly capable of generalizing to unseen data, with minimal discrepancies between the training and validation sets.
The loss graph complements this observation, showing a steep decline in both training and validation losses early in the training process. The losses stabilize at low values, reflecting the model’s strong learning capability and its ability to minimize errors. Although there are minor fluctuations, particularly a noticeable spike in the validation loss around epoch 30, the overall trend suggests that the model effectively converges, with both losses remaining low and closely aligned.
These results underscore the model’s robustness and efficiency, highlighting its ability to achieve near-perfect accuracy with minimal overfitting. The minor variations in the validation loss suggest that the model handles the complexity of the data well, maintaining a high level of performance throughout the training process.
The SMOTE algorithm was applied to balance the dataset and address the shortcomings of the initial model. The results of the subsequent experiments using this balanced dataset are presented in
Table 5. Here, we see that the model performance was improved for all classes, including the previously problematic “reset-both” class, which now showed a precision of 0.77, a recall of 0.87, and an F1-Score of 0.82. The “allow” and “drop” classes maintained their perfect scores, while the “deny” classes showed a slight drop but still had high scores. When tested on the set of 13,107 samples in
Table 4, the model’s overall accuracy was 1.00, indicating a high level of predictive capability across all classes.
The confusion matrix presented in
Figure 7 underscores the elevated level of predictive precision achieved by the ANN after applying SMOTE to a balanced training dataset. The ANN demonstrated exceptional accuracy in the classification of ‘allow’, ‘deny’, and ‘drop’ directives, with the classification of ‘drop’ actions being particularly noteworthy due to its flawless accuracy. Notwithstanding a slight misclassification in the ‘reset-both’ category, such instances were negligible in the grand scope of the data, emphasizing the model’s resilience. The findings corroborate the effectiveness of employing SMOTE with an ANN to amplify the training dataset’s balance, thereby significantly boosting the model’s discriminative capacity across the various classes within the dataset.
Figure 8 presents the graphs depicting the model’s training and validation accuracy and its training and validation loss.
The training and validation graphs presented provide critical insights into the performance of the ANN model when combined with SMOTE, as reflected in the numerical results of
Table 5. The accuracy graph (left) shows that while training accuracy steadily increases, validation accuracy demonstrates fluctuations across epochs. These fluctuations indicate potential overfitting, where the model might be learning patterns specific to the training data but not generalizing well to unseen data. This behavior is consistent with the precision, recall, and F1-Scores observed, particularly for the minority classes like “deny” and “reset-both”, where the recall values are lower, leading to a weighted average accuracy of 90%. The loss graph (right) further corroborates these observations, with the validation loss not following the smooth decrease seen in the training loss. This divergence suggests that while the model is minimizing error on the training data, it struggles to achieve the same level of performance on the validation set, reinforcing the need for further tuning or alternative approaches to improve generalization.
The results presented in
Table 6 demonstrate the performance of the ANN model when applying the ADASYN function for data balancing. The overall accuracy is 85%, indicating a decrease compared to the SMOTE-enhanced model. This reduction is primarily attributable to the “deny” class, which shows a recall of 60% and an F1-Score of 68%, highlighting the model’s struggle to accurately identify instances in this class. While the “allow” and “drop” classes exhibit high precision and recall, the “reset-both” class also encounters moderate challenges, as evidenced by its F1-Score of 79%. The weighted and macro averages, both at 85%, reflect a more balanced yet less robust performance across all classes compared to the SMOTE results. These findings suggest that while ADASYN contributes to addressing class imbalance, it may not be as effective as SMOTE in enhancing the model’s overall classification accuracy and stability across all classes. Further optimization or the exploration of alternative methods may be necessary to achieve higher performance levels, particularly for the more challenging minority classes.
The
Figure 9 confusion matrix, coupled with the performance metrics in
Table 6, offers a detailed insight into the model’s classification capabilities following the application of the ADASYN function. The model exhibits outstanding accuracy in predicting the “allow” and “drop” classes, with near-perfect precision and recall values, as evidenced by minimal misclassifications. However, the matrix reveals a notable challenge in distinguishing between the “deny” and “reset-both” classes, where a significant number of “deny” instances are incorrectly classified as “reset-both”. This misclassification is reflected in the lower recall and F1-Scores for the “deny” class, indicating that while the model is generally effective, there is a clear need for further refinement to enhance its ability to accurately differentiate between these classes. These results suggest that while the model performs well overall, targeted improvements are necessary to address these specific classification challenges and improve its robustness in more complex scenarios.
Figure 10 illustrates the training and validation accuracy and loss curves across 50 epochs for the ANN model enhanced with the ADASYN function. The accuracy graph shows a steady increase in both training and validation accuracy, eventually converging around 85%, indicating that the model achieves a high level of generalization without overfitting. The close alignment of the training and validation curves further suggests that the model maintains consistency in its performance across both the training and validation datasets.
On the loss graph, both training and validation losses decrease significantly during the initial epochs, stabilizing around a similarly low value after approximately 20 epochs. The minimal gap between the training and validation loss curves, along with their convergence, indicates that the model is learning effectively, with no signs of significant overfitting or underfitting. Overall, these graphs confirm that the ANN model, when combined with ADASYN, performs robustly, achieving a balanced and reliable classification across the imbalanced dataset.
Table 7 summarizes the performance metrics for the ANN model enhanced with the BorderlineSMOTE function, highlighting its effectiveness in classifying across different categories. The model achieves exemplary results, particularly in the “allow” and “reset-both” classes, with F1-Scores of 1.00 and 0.98, respectively, reflecting nearly perfect precision and recall. The “deny” class also performs strongly, with an F1-Score of 0.95, showcasing the model’s ability to handle this category with high accuracy despite the inherent complexity. The “drop” class, though with a lower support, still attains a notable F1-Score of 0.92, supported by a perfect recall, demonstrating the model’s robustness even with fewer instances.
The overall accuracy of 0.97, combined with the high weighted and macro averages of 0.96 and 0.97, respectively, indicates that the model not only addresses class imbalance effectively but also sustains a high level of accuracy across all classes. These results underscore the model’s capability to deliver balanced and reliable classification performance, making it a highly suitable approach for tasks that demand both precision and sensitivity, particularly in imbalanced datasets.
Figure 11 illustrates the confusion matrix for the ANN model combined with the BorderlineSMOTE function, showcasing the model’s classification performance across the four classes. The model demonstrates exceptional accuracy, particularly in the “allow” and “reset-both” classes, with a high number of correctly classified instances (6001 and 5867, respectively) and minimal misclassifications. The “deny” class also performs well, although there is a slight decrease in precision, as indicated by 384 instances being misclassified as “reset-both” and 160 as “allow”. Notably, the “drop” class is perfectly classified, with all 2122 instances accurately identified.
This confusion matrix highlights the model’s robustness, especially in handling the “allow” and “reset-both” classes, and its effectiveness in maintaining high classification accuracy despite class imbalances. The minor misclassifications in the “deny” class suggest an area for potential improvement, but overall, the model exhibits outstanding performance, providing reliable predictions across all categories. These results underscore the efficacy of integrating ANNs with the BorderlineSMOTE function in enhancing the model’s ability to accurately differentiate between various classes.
Figure 12 displays the accuracy and loss graphs for both the training and validation phases in the ANN model enhanced with the BorderlineSMOTE function. The accuracy graph shows that both training and validation accuracy improve steadily, stabilizing around 97%, indicating that the model has achieved high generalization capability. The close alignment between the training and validation accuracy curves, with only minor fluctuations, suggests that the model is not overfitting and is performing consistently across both datasets.
The loss graph further corroborates this observation, with both training and validation losses decreasing over time. The training loss stabilizes early, while the validation loss continues to fluctuate slightly, which is expected given the complexity of the data and the nature of the BorderlineSMOTE function. Despite a few spikes in the validation loss, the overall downward trend and the eventual convergence of the training and validation losses indicate that the model is effectively learning and generalizing.
These results affirm that the ANN model, when combined with BorderlineSMOTE, not only handles class imbalances effectively but also maintains strong predictive performance with minimal overfitting, making it a robust choice for complex classification tasks.
The real-time performance metrics in
Table 8 reveal notable differences in latency and throughput among the various ANN models used for firewall packet classification. The baseline ANN model, without any data balancing techniques, exhibits the highest latency at 0.355833 s per prediction and the lowest throughput at 360.58 samples per second, indicating slower processing and less efficiency in handling large volumes of data in real-time scenarios.
On the other hand, the ANN models integrated with data balancing techniques, such as SMOTE, BorderlineSMOTE, and ADASYN, demonstrate significant improvements in real-time performance. The ANN + BorderlineSMOTE model, in particular, achieves the highest throughput of 510.53 samples per second with a markedly lower latency of 0.107312 s per prediction. This suggests that BorderlineSMOTE not only enhances the model’s classification accuracy but also optimizes its efficiency in real-time processing.
Similarly, the ANN + ADASYN model achieves the lowest latency among the tested models, at 0.089748 s per prediction, along with a robust throughput of 444.19 samples per second. The ANN + SMOTE model also shows considerable gains, with a latency of 0.107679 s and a throughput of 406.59 samples per second, demonstrating the effectiveness of SMOTE in improving both classification performance and processing speed.
Overall, these findings underscore the benefits of integrating advanced data balancing techniques with ANNs. Not only do these techniques enhance the accuracy of the models when handling imbalanced datasets but they also significantly improve real-time processing capabilities, making the models more suitable for deployment in time-sensitive network security environments.
Table 9 provides a comparative analysis of various studies on firewall packet classification, highlighting the performance of different models across a range of datasets and feature sets. In comparison to other studies, the ANN model combined with BorderlineSMOTE in our investigation demonstrates a high level of accuracy at 97%, with a weighted average of 96%. This performance is notably strong, especially when considering the complex nature of the data and the challenges associated with class imbalance, which were effectively mitigated using BorderlineSMOTE.
The comparison shows that our model’s performance is competitive with other advanced classifiers, such as Random Forest (RF) and Support Vector Machine (SVM), which are commonly used in similar contexts. For instance, the Random Forest model in AL-Tarawneh and Bani-Salameh’s [
8] study achieved a slightly higher accuracy of 99.7%, though this model’s recall was lower at 0.85, indicating that while it was highly precise, it may not have been as effective at identifying all relevant instances. Similarly, the study by Rahman [
25] et al. reported a 99% accuracy using SVM, RF, k-NN, and Logistic Regression, which is slightly higher than our results but without addressing the specific challenges of imbalanced datasets that our study confronts.
Furthermore, the diversity of models used across the different studies, such as Decision Trees (DTs), SVMs with various activation functions, and ensemble methods, highlights the complexity of achieving high accuracy in firewall packet classification. Despite this complexity, our study’s results are robust, showcasing the efficacy of integrating ANNs with data balancing techniques like BorderlineSMOTE to achieve a balanced, high-performance model. This positions our approach as a strong contender in the domain, particularly in tasks requiring both high accuracy and the ability to handle class imbalance effectively.
6. Discussion
The findings of this study offer valuable insights into the application of Artificial Neural Networks (ANNs) combined with advanced data balancing techniques—specifically SMOTE, ADASYN, and BorderlineSMOTE—for the classification of firewall packets. The results not only demonstrate the robustness of ANNs in this context but also emphasize the critical importance of addressing class imbalance to enhance model performance across all categories.
Our analysis revealed that integrating ANNs with BorderlineSMOTE resulted in the most effective model, achieving an accuracy of 97% with a weighted average precision and recall of 96%. This model’s strong and consistent performance across all classes, including minority ones, is further underscored by its real-time efficiency, as evidenced by the latency and throughput metrics in
Table 8. With a latency of 0.107312 s per prediction and a throughput of 510.53 samples per second, the ANN + BorderlineSMOTE model not only achieves high classification accuracy but also demonstrates superior real-time processing capabilities, making it a robust tool for practical network security applications.
In comparison to prior studies, such as the Random Forest (RF) model used by AL-Tarawneh and Bani-Salameh [
8], which achieved slightly higher accuracy at 99.7%, our approach offers a more balanced performance across all classes, particularly those that are underrepresented. The RF model’s lower recall (0.85) indicates potential struggles with accurately identifying minority classes, a challenge our model effectively addresses through the use of BorderlineSMOTE.
Ertam and Kaya [
9] explored Support Vector Machines (SVMs) with different activation functions, achieving a high recall of 0.985 with the sigmoid function, but their results were less consistent across metrics. In contrast, our model with BorderlineSMOTE maintained a more balanced performance, achieving high precision, recall, and F1-Scores across all classes. Furthermore, our model exhibited lower latency and higher throughput compared to traditional methods, making it more suitable for real-time applications.
The study by AL-Behadili [
19], which reported a 99.839% accuracy using Decision Trees (DTs), highlights the necessity of addressing class imbalance, an area that our research specifically targeted. While their model achieved high accuracy, it did not address the complexities of imbalanced data, which are critical for ensuring robust model performance. Our findings indicate that the integration of data balancing techniques like BorderlineSMOTE can significantly improve the model’s ability to handle such challenges, as reflected in both the classification accuracy and real-time performance.
Similarly, Rahman et al. [
25] achieved 99% accuracy using a combination of SVM, Random Forest, k-NN, and Logistic Regression. However, their lack of focus on class imbalance limits the generalizability of their findings to datasets with underrepresented classes. Our approach, which integrates ANNs with data balancing techniques, particularly BorderlineSMOTE, provides a more comprehensive solution, ensuring balanced classification across all classes and demonstrating efficiency in real-time processing.
Sharma et al. [
31] employed various classifiers, achieving 99.8% accuracy with a stacking ensemble, but the direct handling of class imbalance through techniques like BorderlineSMOTE, as demonstrated in our study, offers a more targeted approach to ensuring consistent performance across all classes, particularly the minority ones. Additionally, our model’s favorable real-time metrics highlight its practicality in real-world security scenarios where quick and accurate decision-making is essential.
The comparison with prior studies underscores the importance of not only achieving high accuracy but also ensuring that models perform consistently across all classes, particularly in imbalanced datasets. Our findings suggest that the integration of ANNs with targeted data balancing techniques like BorderlineSMOTE offers a robust solution for firewall packet classification, especially in applications where the correct classification of minority classes is critical. The added benefit of superior real-time performance further strengthens the case for using such integrated approaches in operational environments.
Future research could explore further combinations of ANNs with other advanced techniques, such as ensemble methods or hybrid approaches that incorporate different data balancing strategies. Additionally, expanding the scope of datasets to include more diverse network traffic scenarios could enhance the generalizability of these findings. Addressing these aspects will contribute to the development of more sophisticated and effective network security solutions.
In conclusion, while our study’s accuracy is slightly lower than some of the highest-reported figures in the literature, the balanced and consistent performance across all metrics, combined with the enhanced real-time processing capabilities, makes it a compelling approach for firewall packet classification. This study highlights the importance of addressing class imbalance and provides a foundation for future exploration in this area, contributing to the broader field of cybersecurity and machine learning.