CN114048795B - Service type identification method based on PCA and XGBoost fusion - Google Patents
Service type identification method based on PCA and XGBoost fusion Download PDFInfo
- Publication number
- CN114048795B CN114048795B CN202111202293.8A CN202111202293A CN114048795B CN 114048795 B CN114048795 B CN 114048795B CN 202111202293 A CN202111202293 A CN 202111202293A CN 114048795 B CN114048795 B CN 114048795B
- Authority
- CN
- China
- Prior art keywords
- xgboost
- dimensional
- data set
- parameter
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000004927 fusion Effects 0.000 title claims abstract description 17
- 238000013145 classification model Methods 0.000 claims abstract description 26
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 32
- 238000000513 principal component analysis Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 7
- 238000012545 processing Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 238000007689 inspection Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2134—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a service type identification method based on PCA and XGBoost fusion, which comprises the following steps: step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail; step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features; s3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels; s4, obtaining a trained limit gradient lifting XGBoost classification model; and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result. The invention can not only reduce the complexity of the identification method, but also improve the accuracy of identifying the service types.
Description
Technical Field
The invention relates to the technical field of communication networks, in particular to a service type identification method based on PCA and XGBoost fusion.
Background
With the continuous development of information technology, the internet traffic scale is also increasing year by year, and new business layers of the network are endless. Although the development and development of our society are greatly promoted, the popularization of the new services also absorbs a great amount of customer resources for telecom operators, but the network contains various encrypted traffic, and generates great impact on the bottom traffic model and the upper application mode of the network, and in order to improve network management, improve network service, ensure network environment safety, effectively identify the encrypted traffic of various application services, thereby constructing an operable and manageable network becomes a key research direction at present.
Conventional traffic type identification methods include a port-based traffic identification method and a Deep Packet Inspection (DPI) -based traffic identification method. Port-based traffic identification is the classification of network traffic by known port numbers in the TCP/UDP packet header, which was initially very efficient and easy to implement for real-time traffic classification, but today various network applications do not use well-known ports to avoid being detected, and some network applications may use dynamic port numbers when in use. Therefore, the current port-based traffic classification cannot produce a real result, and the classification accuracy is not high. The flow identification method based on Deep Packet Inspection (DPI) is essentially a data message filtering technology, and besides supporting message header analysis of an L2 layer data link layer, an L3 layer network layer and an L4 layer transmission layer, the DPI also increases analysis of an L7 layer application layer effective load, so that various application types and contents thereof can be identified. However, since most traffic currently uses various encryption techniques to prohibit inspection of packet payloads, the classification accuracy of Deep Packet Inspection (DPI) is not very high. The current trend is to use machine learning methods for IP traffic classification.
In recent years, artificial intelligence technology based on machine learning has achieved attention in terms of computer vision, natural language processing, speech recognition, image medical treatment, etc., and has been far superior to conventional solutions in many fields. The scientificity and effectiveness of machine learning in processing classification tasks are fully revealed, and machine learning and data mining technologies are gradually applied and developed in the field of network space security, so that the machine learning technology also provides possibility for solving the problem of encryption traffic classification solved by the traditional method. The traditional service type identification method can not identify the encrypted traffic and has the problems of low identification accuracy and the like.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a service type identification method based on the fusion of PCA and XGBoost, which not only can reduce the complexity of the identification method, but also can improve the accuracy of service type identification.
The invention adopts the following technical scheme for solving the technical problems:
The invention provides a service type identification method based on PCA and XGBoost fusion, which comprises the following steps:
step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail;
Step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features;
S3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional dataset with the labels in step S3, wherein the key performance indexes form a dataset with labels, the dataset is divided into a training set and a testing set, the training set is input into a limit gradient lifting XGBoost classification model for training, an improved parameter tuning method is adopted for tuning the learning rate gamma and regularization parameter lambda of the XGBoost classification model, the learning rate gamma and regularization parameter lambda which are most suitable for the network flow dataset are obtained, and the XGBoost classification model with the parameters being tuned is tested, so that a trained limit gradient lifting XGBoost classification model is obtained;
the method for optimizing the learning rate gamma and the regularization parameter lambda specifically comprises the following steps:
step S4.1, limit gradient lifting XGBoost classification model is:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstart,γend,μγ)
λ=(λstart,λend,μλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multidimensional feature F in the step S2 is specifically expressed as follows:
F=[f1,f2,f3…fd]
F contains vectors of d features, F i represents an ith key feature index, d is more than or equal to i and more than or equal to 1, and the maximum value of F i is respectively normalized in the following specific modes:
max (f i) is the maximum value at which the i-th key feature index occurs, Is the ith key characteristic index after normalization treatment.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multi-dimensional feature is simplified into a low-dimensional feature by utilizing the method of feature dimension reduction of the principal component analysis in the step S3, and the original multi-dimensional feature is assumed to be d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k is less than d;
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
step S3.6, y=px, and Y is a low-dimensional feature obtained by reducing the dimension to k dimensions by the method of feature dimension reduction by principal component analysis.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the actual data set without the label is acquired through a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost with PCA dimension reduction, so that a service type classification result is obtained.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, in step S4, the key performance indexes with high correlation comprise source IP, destination IP, source port number, destination port number, protocol type and data length.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
The method solves the problems that the traditional service type identification method can not identify the encrypted flow and has lower identification accuracy; the accuracy of the test set and the complexity of the algorithm are balanced by utilizing principal component analysis and XGBoost algorithm, so that efficient and reliable service type identification can be realized.
Drawings
Fig. 1 is a flow chart of a service type identification system based on PCA and XGBoost provided by the present invention.
Fig. 2 is a schematic diagram of feature dimension reduction using Principal Component Analysis (PCA) algorithm.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The invention provides a thinking of feature degradation and deep learning to be applied to a service type identification scene, and firstly, network traffic big data is acquired through a wireshark network packet capturing tool, and a service type identification method based on fusion of PCA and XGBoost is provided. By utilizing the thought of feature dimension reduction, complex multidimensional features are simplified into low-dimensional features based on a Principal Component Analysis (PCA) algorithm, the complexity of model training is reduced, then a large number of reliable data sets with labels are trained by utilizing a limit gradient lifting model (XGBoost), and finally service type identification is carried out on actually acquired data traffic. The method combines PCA and XGBoost, reduces algorithm complexity and can improve service type identification.
Based on the service type identification application scenario, the invention provides a service type identification method based on the fusion of PCA and XGBoost, as shown in figure 1, the method comprises the following steps:
Step S1, a part of network flow data set with labels is collected through the Wireshark, wherein the data set comprises HTTP, NTP, DNS, QQ data sets of service types such as WeChat, video and mail, and the specific representation modes of the data set characteristics are as follows:
F=[f1,f2,f3…fd]
F is a vector containing d features, and F i represents the i-th key feature index.
And S2, carrying out data cleaning and feature extraction on the data set with the tag acquired in the step 1, removing repeated samples and invalid samples of incomplete data, and carrying out normalization processing based on the maximum value of each key feature. The specific processing mode is as follows:
Wherein the method comprises the steps of Refers to the ith key characteristic index after normalization. max (f i) refers to the maximum value at which the i-th key feature index occurs.
And step S3, before training the classification model, reducing complexity of data dimension while ensuring accuracy of the classification model, and simplifying the complex multidimensional feature in the step 2 into a low-dimensional feature by using a method of feature dimension reduction by principal component analysis, namely converting an original d-dimensional dataset into a k-dimensional dataset, wherein k is less than d. Therefore, some important characteristics of the data are reserved, and the specific processing mode of the main component analysis data dimension reduction is as follows:
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional data set with the labels in step S3, wherein the key performance indexes form a data set with the labels, the data set is divided into a training set and a testing set, the training set is input into a limit gradient lifting model (XGBoost) for training, the classifier is subjected to parameter tuning, the model subjected to parameter tuning is tested, and the concrete expression form of the limit gradient lifting model (XGBoost) is as follows:
Step S4.1, defining an objective function:
Wherein the method comprises the steps of For measuring predictive value as a loss functionAnd the target value y i, in this invention, refers to the difference between the predicted traffic type label and the actual traffic type label.
For regularization term, defined as:
Wherein T refers to the number of leaf nodes, lambda is a regularization parameter, gamma is a learning rate, and w j represents a predicted value of a j-th leaf node. Since XGBoost is the forward distribution algorithm, the t-th result is the result of the previous t-1 times plus the current weak classifier. Each iteration therefore finds the CART tree that maximizes the reduction of the loss function, so the objective function can be rewritten as:
const means that at round t, the regularization term of the previous t-1 iteration can be considered constant. Approximating the objective function by taylor expansion to obtain:
wherein, Removing constant terms for the t-th round of iteration
Obtaining an objective function:
The objective function depends only on the first and second derivatives of each piece of data on the error function. After the regularization term is processed, the final objective function can be rewritten as:
Wherein:
I j is defined as the index set of samples, whose values are associated with leaf node j.
Assuming that the structure of the decision tree has been determined, the predicted value on each leaf node can be obtained by zeroing the derivative of the loss function, which can be written as:
Thus, the final objective function can be written as:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstart,γend,μγ)
λ=(λstart,λend,μλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
To illustrate the effectiveness of the method of the present invention, examples are given. The acquisition of the example data is obtained by the system constructed by the figure 1 through a network packet capturing tool Wireshark, and the protocol types and the service types mainly obtained are HTTP protocol, NTP protocol, DNS protocol, QQ, weChat, messenger video, E-mail and the like. In addition, 8 key characteristic indexes are considered, namely a time stamp, a source IP, a destination IP, the length of a data packet, an intermediate protocol, a source port number, a destination port number, whether ACK/SYN is contained or not and the like.
And step 1, collecting part of the network flow data set with the label through the Wireshark, and carrying out data cleaning and feature extraction on the data set to remove repeated samples and invalid samples of incomplete data. The specific representation of the dataset features is as follows:
F=[f1,f2,f3…fd]
step 2, for facilitating analysis, carrying out normalization processing on the historical data, wherein the specific representation mode of the normalization processing is as follows:
Wherein the method comprises the steps of Refers to the ith key characteristic index after normalization. max (f i) refers to the maximum value at which the i-th key feature index occurs.
Step 3, simplifying complex multidimensional features into low-dimensional features by using a principal component analysis feature dimension reduction method, namely converting an original d-dimensional dataset into a k-dimensional dataset, so as to keep some features of the dataset, which are most important, and reducing the dimension of the PCA features in the following specific modes:
(1) The original data sets are formed into a matrix X of d rows and n columns.
(2) The original data set being de-centralised, i.e.Where x i is one sample.
(3) Solving covariance matrix of the centralized data set:
(4) And obtaining eigenvalues and corresponding eigenvectors of the covariance matrix.
(5) And arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom, and taking the first k rows to form a matrix P.
(6) Y=px is the data after the dimension is reduced to k dimension.
Finally, 6 most relevant characteristic indexes for measuring the service types are selected by utilizing a principal component analysis algorithm, and the complexity of the subsequent model training can be reduced by utilizing the algorithm as shown in fig. 2.
Step 4, dividing the data set with the label into a training set and a testing set, inputting the training set and the testing set into a limit gradient lifting model (XGBoost) for training, performing parameter tuning on the classifier, testing the model after parameter tuning, and performing an objective function of the limit gradient lifting model (XGBoost) as follows:
Wherein:
And then searching an optimal segmentation method by adopting a greedy algorithm. The basic idea is to split one leaf node at a time from the root node and select a split according to the specific conditions of each possible split. XGBoost also have specific criteria to select the best segmentation. Substituting the predicted value into the loss function to obtain the minimum value of the loss function.
And step 5, inputting the actually obtained data set without the label into XGBoost classification models for testing to obtain service type identification results.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.
Claims (5)
1. The service type identification method based on the fusion of PCA and XGBoost is characterized by comprising the following steps:
step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail;
Step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features;
S3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional dataset with the labels in step S3, wherein the key performance indexes form a dataset with labels, the dataset is divided into a training set and a testing set, the training set is input into a limit gradient lifting XGBoost classification model for training, an improved parameter tuning method is adopted for tuning the learning rate gamma and regularization parameter lambda of the XGBoost classification model, the learning rate gamma and regularization parameter lambda which are most suitable for the network flow dataset are obtained, and the XGBoost classification model with the parameters being tuned is tested, so that a trained limit gradient lifting XGBoost classification model is obtained;
the method for optimizing the learning rate gamma and the regularization parameter lambda specifically comprises the following steps:
step S4.1, limit gradient lifting XGBoost classification model is:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstart,γend,μγ)
λ=(λstart,λend,μλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
2. The method for identifying a service type based on the fusion of PCA and XGBoost as set forth in claim 1, wherein the multidimensional feature F in step S2 is specifically expressed as follows:
F=[f1,f2,f3…fd]
F contains vectors of d features, F i represents an ith key feature index, d is more than or equal to i and more than or equal to 1, and the maximum value of F i is respectively normalized in the following specific modes:
max (f i) is the maximum value at which the i-th key feature index occurs, Is the ith key characteristic index after normalization treatment.
3. The method for identifying a service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the method for feature dimension reduction by using principal component analysis in step S3 simplifies the multi-dimensional feature into a low-dimensional feature, assuming that the original multi-dimensional feature is d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k < d;
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
step S3.6, y=px, and Y is a low-dimensional feature obtained by reducing the dimension to k dimensions by the method of feature dimension reduction by principal component analysis.
4. The method for identifying the service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the actual data set without the label is obtained by a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost subjected to PCA dimension reduction, so as to obtain a service type classification result.
5. The method for identifying a service type based on the fusion of PCA and XGBoost as defined in claim 1, wherein in step S4, the key performance indicators with high correlation include source IP, destination IP, source port number, destination port number, protocol type and data length.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111202293.8A CN114048795B (en) | 2021-10-15 | 2021-10-15 | Service type identification method based on PCA and XGBoost fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111202293.8A CN114048795B (en) | 2021-10-15 | 2021-10-15 | Service type identification method based on PCA and XGBoost fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114048795A CN114048795A (en) | 2022-02-15 |
CN114048795B true CN114048795B (en) | 2024-11-08 |
Family
ID=80205105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111202293.8A Active CN114048795B (en) | 2021-10-15 | 2021-10-15 | Service type identification method based on PCA and XGBoost fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114048795B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114978585B (en) * | 2022-04-12 | 2024-02-27 | 国家计算机网络与信息安全管理中心 | Deep learning symmetric encryption protocol identification method based on flow characteristics |
CN115277585B (en) * | 2022-07-08 | 2023-07-28 | 南京邮电大学 | Multi-granularity business flow identification method based on machine learning |
CN116975401A (en) * | 2023-09-19 | 2023-10-31 | 杭州美创科技股份有限公司 | Database field identification method, device, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107682216A (en) * | 2017-09-01 | 2018-02-09 | 南京南瑞集团公司 | A kind of network traffics protocol recognition method based on deep learning |
CN111586728A (en) * | 2020-04-30 | 2020-08-25 | 南京邮电大学 | Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109309630B (en) * | 2018-09-25 | 2021-09-21 | 深圳先进技术研究院 | Network traffic classification method and system and electronic equipment |
CN109639481B (en) * | 2018-12-11 | 2020-10-27 | 深圳先进技术研究院 | Deep learning-based network traffic classification method and system and electronic equipment |
-
2021
- 2021-10-15 CN CN202111202293.8A patent/CN114048795B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107682216A (en) * | 2017-09-01 | 2018-02-09 | 南京南瑞集团公司 | A kind of network traffics protocol recognition method based on deep learning |
CN111586728A (en) * | 2020-04-30 | 2020-08-25 | 南京邮电大学 | Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method |
Also Published As
Publication number | Publication date |
---|---|
CN114048795A (en) | 2022-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114048795B (en) | Service type identification method based on PCA and XGBoost fusion | |
CN112564974B (en) | Deep learning-based fingerprint identification method for Internet of things equipment | |
CN113037730B (en) | Network encryption traffic classification method and system based on multi-feature learning | |
CN113705712B (en) | Network traffic classification method and system based on federal semi-supervised learning | |
CN112671757B (en) | Encryption flow protocol identification method and device based on automatic machine learning | |
CN112163594A (en) | Network encryption traffic identification method and device | |
CN113079069B (en) | Mixed granularity training and classifying method for large-scale encrypted network traffic | |
CN114172688B (en) | Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL) | |
CN113989583A (en) | Method and system for detecting malicious traffic of internet | |
CN116260642A (en) | Knowledge distillation space-time neural network-based lightweight Internet of things malicious traffic identification method | |
CN109299185B (en) | Analysis method for convolutional neural network extraction features aiming at time sequence flow data | |
CN108319672A (en) | Mobile terminal malicious information filtering method and system based on cloud computing | |
CN114553722A (en) | VPN and non-VPN network flow classification method based on multi-view one-dimensional convolution neural network | |
Xue et al. | Classification and identification of unknown network protocols based on CNN and T-SNE | |
Yang et al. | Deep learning-based reverse method of binary protocol | |
CN116451138A (en) | Encryption traffic classification method, device and storage medium based on multi-modal learning | |
CN114826776B (en) | Weak supervision detection method and system for encrypting malicious traffic | |
CN113935398B (en) | Network traffic classification method and system based on small sample learning in Internet of things environment | |
Ding et al. | Network attack detection method based on convolutional neural network | |
CN107239787A (en) | A kind of utilization multi-source data have the Image classification method of privacy protection function | |
CN114979017B (en) | Deep learning protocol identification method and system based on original flow of industrial control system | |
Wei | Deep learning model under complex network and its application in traffic detection and analysis | |
Wu et al. | Identifying potential standard essential patents based on text mining and generative topographic mapping | |
CN112367325B (en) | Unknown protocol message clustering method and system based on closed frequent item mining | |
He et al. | Identification of SSH applications based on convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |