1. Introduction
A network invasion is a brief interruption in regular network activity. Some outbreaks are maliciously produced by attackers, such as a denial of service (DoS) assault on an internet protocol (IP) network, while others are entirely accidental, such as an overpass collapsing into a busy road network [
1]. Rapid detection is required to launch a rapid reaction, such as dispatching an ambulance following a traffic accident or sounding an alert when a surveillance network detects an intruder. Data are collected at a rapid pace by network monitoring equipment. As a result, developing an efficient anomaly detection system entails extracting important information from enormous amounts of noisy and big data. It is also critical to develop distributed algorithms since networks have bandwidth and power restrictions, and communication costs must be kept to a minimum. With the sophistication and innovation of technology, software, and network topologies, cyber intrusions continue to evolve. To protect the network from hostile cybersecurity threats, intrusion detection systems are highly recommended [
1]. There are firewalls and other rule-based security mechanisms in place, which are commonly utilized to safeguard contemporary data centers against network threats. Large distributed multi-cloud systems, on the other hand, would need a large number of complicated rules, which might be costly, time-consuming, and error-prone [
2]. Furthermore, developments in computing have enabled attackers to scale attacks, such as the emergence of distributed DoS (DDoS) attacks, which are seldom detected by ordinary firewalls [
3]. As a result, using a firewall alone is insufficient to offer total system security in multi-cloud environments.
Deep neural networks can learn complicated patterns of abnormalities directly from network traffic data; therefore, machine learning approaches have attracted a lot of attention in recent years [
4]. Real-world traffic data, on the other hand, are large-scale, noise-labeled, and imbalanced in class. In other words, there are millions of samples in the traffic data that are unevenly distributed, with infrequent abnormalities and far too much typical traffic data. The majority of current network datasets do not fulfill real-world requirements and are therefore unsuitable for modern networks. Furthermore, conventional datasets, such as kddcup99 [
5] and UNSW-NB15 [
6], have received a lot of attention in the literature. These datasets have enabled methods to achieve great throughput. As a result, in this study, we focus on the problem of large-scale (millions of scales) and highly imbalanced traffic data by training, validating, and testing the suggested solution using the ZYELL dataset [
7,
8].
In this paper, the data were collected using IoT technology, which can access data from different cloud environments including vehicular networks and numerous cooperation. Thus, the collected data were in massive amounts and transmitted using IoT technology without any delays and lag toward the data pre-processing layer and were then saved in blockchain technology-secured clouds.
The novel proposed model in this study uses machine learning algorithms. These machine learning algorithms are utilized to detect meddling and normal communication transactions. The proposed model uses blockchain technology to overcome the security threats of trained models and data preprocessing. Thus, the proposed model achieved the highest detection results with the help of machine learning and blockchain technology.
2. Literature Review
A great number of research studies on IDS employing machine learning approaches have been published in the literature. For example, the authors of [
9,
10] employed support vector machines (SVMs) to find abnormalities in the KDD dataset [
5]. The authors of [
11] constructed IDS models for anomaly detection using artificial neural networks utilizing the same dataset. The author of [
12] employed cascade classifiers to detect and classify anomalies in the KDD dataset, even when they were spread unevenly. References [
13,
14] suggested utilizing decision trees and random forest (RF) to discover anomalies. Furthermore, the authors of [
15] presented hybrid approaches that make use of two or more machine learning algorithms. In terms of performance, the results suggest that hybrid strategies outperform individual models. For further information on machine learning methodologies in IDS, readers can refer to the surveys provided in [
16,
17,
18].
The SVM is used for early classification and regression problems [
19]. The SVM belongs to the linear generalized classification family and has the special property of simultaneously increasing the geometric margin and decreasing the empirical classification error, so in this manner, the SVM is also known as the maximum margin classifier. The SVM inputs its features or vector in the maximum higher separated hyperplane with a higher dimensional space. For the separation of data, parallelly, two hyperplanes were constructed. These hyperplanes could increase the distance between these binary hyperplanes. The SVM considers the data points {(x
1,y
1), (x
2,y
2), (x
3,y
3), (x
4,y
4)… (x
n,y
n)} where each x
n is considered as the input vector.
KNN is a scenario-learning technique [
20] that makes classification choices using all of the training data. It is unsuitable for many applications due to its sluggish learning rate, such as dynamic web mining for a large collection. Finding some representatives to represent all of the training data for classification is a strategy to improve its efficiency; that is, create an inductive learning model from the training dataset and utilize this model (representatives) for classification. Many modern algorithms, such as DT and NN, were designed to produce such a model. One assessment factor involves the efficiency of various algorithms. Thus, we chose these two machine learning algorithms because the SVM is a parametric linear solver algorithm, and most of its success depends on tuning parameters that decrease the overall generalization error. KNN is cost-effective (for classification) and can single-handedly handle large non-parametric datasets.
The authors of [
21] employed a four-layer classification algorithm on the KDD dataset to detect four types of assaults. The findings indicate a modest classification mistake and a minor overall inaccuracy. In addition, the scientists claimed that the number of features in the original dataset was reduced, resulting in lower complexity and higher accuracy. However, the authors found no misclassification as a result of misclassifying one type of violence as another. Using the same dataset and typical unsupervised and supervised learning techniques, such details were described in [
22]. Several attacks were misclassified, resulting in poorer overall accuracy than the research presented in [
21].
The KDD dataset has been extensively utilized to train machine learning models for anomaly detection and classification. The KDD dataset contains four different types of assaults with radically different traffic characteristics. Reference [
21] presents a technique for the categorization of assaults using KDD that achieved remarkably low misclassification errors. However, such models may not be suitable for today’s numerous cloud situations with varying forms of evolution and closely related assaults. Furthermore, the KDD dataset is aging and may no longer reflect current real-time traffic patterns and network threats [
23]. Furthermore, if a new assault is launched, it will most likely go unreported, resulting in a high mistake rate, as documented in [
22].
Machine learning has been used to improve network security in a variety of settings. Buczak et al. [
24] and Hodo et al. [
25] focused on malware detection in the cybernetwork using supervised and unsupervised learning. DaCosta et al. [
26] examined intrusion detection in the context of IoT utilizing machine learning applications. Ucci et al. [
27] and Tahsien et al. [
28] and Hussain et al. [
29] addressed the possible dangers of the IoT and proposed machine learning methods to address these issues. Gibert et al. [
30] used supervised and unsupervised learning to identify and classify malware in a Windows system. Nassif et al. [
31] investigated cloud network dangers and how to defend the cloud network using supervised learning.
Machine learning may be used in network administration to avoid possible attacks in addition to detecting anomalous activity. Jin et al. [
32] used reinforcement learning to determine the appropriate scheduling method for managing intranet traffic while keeping security in mind. Every user has a reputation value that indicates how reliable their traffic is. The available bandwidth of the connections and the flows that are waiting to be scheduled describe the reinforcement learning condition. In the suggested approach, actions are assigned to each stream, and each action is made up of the bandwidth allocated to that stream. Link utilization, queue length, latency, and the user’s trust level all contribute to the scheduler’s performance.
Table 1 shows the previous studies’ limitations and methodologies. Jin et al. [
32] used reinforcement learning to detect insider threats using a public dataset; it achieved 98% accuracy but data preprocessing and imbalanced classes were its limitations. Hamamato et al. [
33] used k-means to detect DDoS and DoS attacks using a public dataset; it achieved 96.53% accuracy but less performance in a supervised manner and with imbalanced classes. Gu et al. [
34] used k-means to detect DDoS attacks using a public dataset; it achieved 98.9% accuracy but a high recall detection and lack in classification. Alauthman et al. [
35] used a neural network to detect botnet attacks using a public dataset; it achieved 98.3% accuracy but data preprocessing and imbalanced classes were its limitations. Smadi et al. [
36] used a neural network to detect phishing threats using a public dataset; it achieved 98.6% accuracy but machine learning and imbalanced classes were its limitations. Xu et al. [
37] used Q-learning to detect general network threats using a public dataset and achieved 95% accuracy; however, data preprocessing and imbalanced classes were its limitations. Sethi et al. [
38] used reinforcement learning to detect DoS threats using a public dataset and achieved 97.8% accuracy; however, preprocessing features and imbalanced classes were its limitations. Rashid et al. [
39] used numerous machine learning approaches, including SVM, DT, KNN, RF, LR, etc., to detect cyber-attacks in IoT-based smart city applications. They used UNSW-NB15 and CICID2017 datasets to detect cyber-attacks. Their system achieved more than 95% cyber-attack detection accuracy but data fuzzing and feature engineering were its limitations. Ofori et al. [
40] used machine learning to predict threats at early stages with the help of LR, DT, NB, and RF, achieving 70% performance in predicting cyber threats; however, data control and attributions were the limitations.
The major contributions of this study are as follows:
The proposed framework employed machine learning approaches to detect network attacks.
The proposed model used blockchain technology to secure the trained models and communicated data.
The proposed study used numerous statistical parameters to evaluate the performance and authenticity of models.
3. Materials and Methods
3.1. Blockchain Module
A blockchain block comprises a massive amount of transaction information, including the block ID, a hash of the previous block, transaction information, null byte, and headers. In a blockchain system where prospectors select the appropriate hash to add a block to, contenders check for a current block before looking for a new one. The proof-of-work process is used to determine the legitimacy of a certain block of transactions. The phases that follow illustrate the essential parts of blockchain technology. Any node in a computerized health system that is connected to the internet must communicate with a storage database as well as settlers on a private blockchain. Unprocessed transactions are stored in the blockchain until a new block is assigned for authentication. Many transactions are examined before being quickly processed by the Merkel tree, a binary hash tree. Blockchain technology will be utilized to develop a new smart medical system connection biosphere since it is diversified and compatible with the internet of medical things applications. The blockchain module provides the proposed model with the following protection.
Onset: To begin implementing data protection parameters, the cryptographic algorithms must first be concocted.
Encrypting: The user defines the block index and the proceeding detail, where the block index is defined as the box index and the proceeding detail as plaintext. Thus, it gives the proposed model Encrypt (index, text) value.
Encampment: The proposed model contains a box index unit, after which the machine learning model delivers security keys .
Defy: It only allows one to complete the full iteration, initiate (create), and send two messages (text1, text2) for each . After completing the iteration and calculation, if it has 0, then , or else it repeats the full iteration.
3.2. Anomaly Detection Module
Figure 1 depicts the overall picture of general network meddling detection, empowered with machine learning, and entangled with blockchain technology. The proposed model consists of four phases. At the initial stage, the proposed model collects data using the Internet of Things (IoT) from the vehicular network and general network communication transactions using PyShark and passes the raw data toward the data preprocessing phase. In this phase, the proposed model applies data cleaning techniques, including unwanted outliers, fixing data structural errors with swift observations to save the time–costs of the model and data redundant techniques, cleaning the data from a bug or null values, and fixing duplicate and missing values. Right after the data pre-processing, the proposed model applies the feature extraction technique with the help of embedded methods and then stores preprocessed data into blockchain-entangled private clouds for further processing. The third stage of the proposed model-training layer imports data from a private cloud of training data to train machine learning algorithms (SVM and KNN) and check the trained model performance. If the learning criteria meet, then the model is stored in a private cloud,
Z, otherwise it retrains the process. In the final phase, the testing layer imports the trained model from the private cloud,
Z, and imports the testing data from the private cloud; then the testing process starts. If the network meddling detects, then the data are processed into spam, otherwise, the data are processed for further communications and transactions. To evaluate the overall performance of testing, the proposed model applies various statistical parameters [
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50], i.e., detection accuracy (DA), sensitivity, F1-score, specificity, likelihood negative ratio (LNR), false positive rate (FPR), false negative rate (FNR), misclassification rate (MCR), likelihood positive ratio (LPR), positive predicted value (PPV), and negative predicted value (NPV) to check the performance of the proposed framework. All equations are explained below.
The computational complexities of the two classification algorithms utilized in this work are briefly discussed below:
The KNN algorithm does not memorize training data; rather, it performs the nearest neighbor calculation online. The algorithm calculates the distance between each test and training sample. Once all distances are computed, they are arranged in ascending order to identify the nearest neighbor. Each training sample is a spot in the attribute space if there are m attributes in the training data, making the attribute space m-dimensional. The computational complexity of the KNN algorithm depends on the number of features (
m), the number of training samples (
n), and an odd number chosen as the value of ‘k’. The following Euclidian distance formula is most frequently used to calculate distances.
Since
m multiplications are required to calculate the distance between each training sample and the test sample, the KNN Euclidean distance-based complexity is O (
mn). Complicating factors include distance sorting. The merge–sort algorithm, for instance, uses O (
n log
n). Therefore, the overall complexity is O (
mn +
n log
n). In an m-dimensional feature space, if
A and
B are two points in the feature space and constitute feature vectors
A = (y1, y2, …, ym) and
B = (z1, z2, …, zm), respectively, the cosine distance between them is given by
It will take m multiplications to compute the dot product in the numerator. Each vector will need to be multiplied m times in order to determine its Euclidian length in the denominator. Therefore, to calculate the similarity, 3
m multiplications are required. The total complexity is O (3
mn +
n log
n) when the sorting algorithm’s complexity is taken into account [
51].
The SVM is frequently used to effectively address both regression and classification problems. The SVM accuracy is generally greater than KNN accuracy. However, both the complexity and training time of the SVM are greater than those of KNN. Abdiansah et al. display the complexity of SVM as O (
n3) in [
52].
5. Simulation and Results
In this research, the proposed framework employed ML for the detection of network meddling empowered with blockchain technology. For the simulation, including training and testing the data, the proposed model used MacBook Pro 2017, 512 GB SSD, 16 GB, Core i5, and embedded GPU. To remove the major discrepancies in data, the proposed model applied numerous data preprocessing techniques and stored preprocessed data in private clouds. The proposed framework employed SVM and KNN machine learning algorithms to train and test network meddling data. All phases of the proposed model were discussed in this research. The proposed framework applied various statistical performance parameters to measure the performance and authenticity of the model.
Figure 2 shows the training performance of the SVM. It depicts the minimum classification error (MCE) of SVM at 30 iterations; optimized results of the SVM show a box constraint level of 0.0010088 with a kernel function cubic, achieving MCE 0.0091778 at the 13th iteration (which was best among all models).
Table 3 shows the training confusion matrix of the SVM. The proposed model detected 27,190 attack instances and 2290 normal instances; it wrongly detect 120 instances.
Table 4 shows the statistical parameter results of training the SVM; the proposed model achieved 99.59%, 0.41%, 99.96%, 95.42%, 99.78%, 99.60%, 99.57%, 4.58%, 0.04%, 21.81%, 0.00, and 99.78% of DA, MCR, Sen, Spec, F1-score, PPV, NPV, FPR, FNR, LPR, LNR, and FMI, respectively.
Table 5 shows the training confusion matrix of KNN. The proposed model detected 27,190 attack instances and 2200 normal instances; it wrongly detected 220 instances.
Table 6 shows the statistical parameter results of training KNN; the proposed model achieved 99.29%, 0.71%, 99.63%, 95.24%, 99.62%, 99.60%, 99.65%, 4.76%, 0.37%, 20.92%, 0.00, and 99.62% of DA, MCR, Sen, Spec, F1-score, PPV, NPV, FPR, FNR, LPR, LNR, and FMI, respectively.
Table 7 shows the testing confusion matrix of SVM. The proposed framework detected 6400 attack instances and 930 normal instances; the proposed model wrongly detected 70 instances.
Table 8 shows the statistical parameter results of testing the SVM; the proposed model achieved 99.05%, 0.95%, 99.84%, 93.94%, 99.46%, 99.07%, 98.94%, 6.06%, 0.16%, 16.47%, 0.00, and 99.46% of DA, MCR, Sen, Spec, F1-score, PPV, NPV, FPR, FNR, LPR, LNR, and FMI, respectively.
Table 9 shows the testing confusion matrix of KNN. The proposed model detected 6400 attack instances and 850 normal instances; it wrongly detected 150 instances.
Table 10 shows the statistical parameter results of testing KNN; the proposed model achieved 97.97%, 2.03%, 99.84%, 85.86%, 98.84%, 97.86%, 98.84%, 14.14%, 0.16%, 7.06%, 0.00, and 98.85% of DA, MCR, Sen, Spec, F1-score, PPV, NPV, FPR, FNR, LPR, LNR, and FMI, respectively.
Table 11 depicts the comparative studies of the current research with previous studies. Jin et al. [
32] used reinforcement learning to detect insider threats using a public dataset, achieving 98% accuracy. Hamamato et al. [
33] used k-means to detect DDoS and DoS attacks using a public dataset, achieving 96.53% accuracy. Gu et al. [
34] used k-means to detect DDoS attacks using a public dataset, achieving 98.9% accuracy. Alauthman et al. [
35] used a neural network to detect botnet attacks (using a public dataset), achieving 98.3% accuracy. Smadi et al. [
36] used a neural network to detect phishing threats (using a public dataset), achieving 98.6% accuracy. Xu et al. [
37] used Q-learning to detect general network threats (using a public dataset), achieving 95% accuracy. Sethi et al. [
38] used reinforcement learning to detect DoS threats (using a public dataset), achieving 97.8% accuracy. Rashid et al. [
39] used numerous machine learning approaches (including SVM, DT, KNN, RF, LR, etc.) to detect cyber-attacks in IoT-based smart city applications. They used UNSW-NB15 and CICID2017 datasets to detect cyber-attacks. Their system achieved more than 95% cyber-attack detection accuracy. Ofori et al. [
40] used machine learning to predict threats at early stages with the help of LR, DT, NB, and RF. They achieved 70% accuracy in predicting cyber threats; the proposed model used machine learning algorithms to detect general network meddling, achieving 99.05% DA, which was the highest when compared to others. The proposed model also provided a blockchain-secured cloud for models and communication data.