CN114048795B

CN114048795B - Service type identification method based on PCA and XGBoost fusion

Info

Publication number: CN114048795B
Application number: CN202111202293.8A
Authority: CN
Inventors: 刘旭; 胡俊华; 朱晓荣; 杨龙祥; 朱洪波; 江婷
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2024-11-08
Anticipated expiration: 2041-10-15
Also published as: CN114048795A

Abstract

The invention discloses a service type identification method based on PCA and XGBoost fusion, which comprises the following steps: step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail; step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features; s3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels; s4, obtaining a trained limit gradient lifting XGBoost classification model; and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result. The invention can not only reduce the complexity of the identification method, but also improve the accuracy of identifying the service types.

Description

Service type identification method based on PCA and XGBoost fusion

Technical Field

The invention relates to the technical field of communication networks, in particular to a service type identification method based on PCA and XGBoost fusion.

Background

With the continuous development of information technology, the internet traffic scale is also increasing year by year, and new business layers of the network are endless. Although the development and development of our society are greatly promoted, the popularization of the new services also absorbs a great amount of customer resources for telecom operators, but the network contains various encrypted traffic, and generates great impact on the bottom traffic model and the upper application mode of the network, and in order to improve network management, improve network service, ensure network environment safety, effectively identify the encrypted traffic of various application services, thereby constructing an operable and manageable network becomes a key research direction at present.

Conventional traffic type identification methods include a port-based traffic identification method and a Deep Packet Inspection (DPI) -based traffic identification method. Port-based traffic identification is the classification of network traffic by known port numbers in the TCP/UDP packet header, which was initially very efficient and easy to implement for real-time traffic classification, but today various network applications do not use well-known ports to avoid being detected, and some network applications may use dynamic port numbers when in use. Therefore, the current port-based traffic classification cannot produce a real result, and the classification accuracy is not high. The flow identification method based on Deep Packet Inspection (DPI) is essentially a data message filtering technology, and besides supporting message header analysis of an L2 layer data link layer, an L3 layer network layer and an L4 layer transmission layer, the DPI also increases analysis of an L7 layer application layer effective load, so that various application types and contents thereof can be identified. However, since most traffic currently uses various encryption techniques to prohibit inspection of packet payloads, the classification accuracy of Deep Packet Inspection (DPI) is not very high. The current trend is to use machine learning methods for IP traffic classification.

In recent years, artificial intelligence technology based on machine learning has achieved attention in terms of computer vision, natural language processing, speech recognition, image medical treatment, etc., and has been far superior to conventional solutions in many fields. The scientificity and effectiveness of machine learning in processing classification tasks are fully revealed, and machine learning and data mining technologies are gradually applied and developed in the field of network space security, so that the machine learning technology also provides possibility for solving the problem of encryption traffic classification solved by the traditional method. The traditional service type identification method can not identify the encrypted traffic and has the problems of low identification accuracy and the like.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a service type identification method based on the fusion of PCA and XGBoost, which not only can reduce the complexity of the identification method, but also can improve the accuracy of service type identification.

The invention adopts the following technical scheme for solving the technical problems:

The invention provides a service type identification method based on PCA and XGBoost fusion, which comprises the following steps:

step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail;

Step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features;

S3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels;

Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional dataset with the labels in step S3, wherein the key performance indexes form a dataset with labels, the dataset is divided into a training set and a testing set, the training set is input into a limit gradient lifting XGBoost classification model for training, an improved parameter tuning method is adopted for tuning the learning rate gamma and regularization parameter lambda of the XGBoost classification model, the learning rate gamma and regularization parameter lambda which are most suitable for the network flow dataset are obtained, and the XGBoost classification model with the parameters being tuned is tested, so that a trained limit gradient lifting XGBoost classification model is obtained;

the method for optimizing the learning rate gamma and the regularization parameter lambda specifically comprises the following steps:

step S4.1, limit gradient lifting XGBoost classification model is:

Wherein T refers to the number of leaf nodes, obj refers to an objective function, G _j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H _j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;

Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:

Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:

γ＝(γ_start,γ_end,μ_γ)

λ＝(λ_start,λ_end,μ_λ)

wherein, gamma _start and gamma _end are the upper boundary and the lower boundary of the search space phi _γ of the learning rate gamma, and mu _γ is the search step length of the learning rate gamma; lambda _start and lambda _end are the upper and lower boundaries of the search space Θ _λ of the regularization parameter lambda, respectively, and mu _λ is the search step size of the regularization parameter lambda;

Step S4.2.2, generating a two-dimensional search parameter set matrix H _S according to the set search space and search step length, wherein the definition is as follows:

Where p is an integer, q is an integer,

Step S4.2.3, for each parameter set in H _S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda _start+qμ_λ in the multiple parameter sets;

Step S4.2.4, wherein lambda _start+qμ_λ、γ_start+pμ_γ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;

and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.

As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multidimensional feature F in the step S2 is specifically expressed as follows:

F＝[f₁,f₂,f₃…f_d]

F contains vectors of d features, F _i represents an ith key feature index, d is more than or equal to i and more than or equal to 1, and the maximum value of F _i is respectively normalized in the following specific modes:

max (f _i) is the maximum value at which the i-th key feature index occurs, Is the ith key characteristic index after normalization treatment.

As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multi-dimensional feature is simplified into a low-dimensional feature by utilizing the method of feature dimension reduction of the principal component analysis in the step S3, and the original multi-dimensional feature is assumed to be d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k is less than d;

s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;

step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';

step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:

Wherein, the superscript T' is a transpose;

S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;

s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;

step S3.6, y=px, and Y is a low-dimensional feature obtained by reducing the dimension to k dimensions by the method of feature dimension reduction by principal component analysis.

As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the actual data set without the label is acquired through a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost with PCA dimension reduction, so that a service type classification result is obtained.

As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, in step S4, the key performance indexes with high correlation comprise source IP, destination IP, source port number, destination port number, protocol type and data length.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

The method solves the problems that the traditional service type identification method can not identify the encrypted flow and has lower identification accuracy; the accuracy of the test set and the complexity of the algorithm are balanced by utilizing principal component analysis and XGBoost algorithm, so that efficient and reliable service type identification can be realized.

Drawings

Fig. 1 is a flow chart of a service type identification system based on PCA and XGBoost provided by the present invention.

Fig. 2 is a schematic diagram of feature dimension reduction using Principal Component Analysis (PCA) algorithm.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention provides a thinking of feature degradation and deep learning to be applied to a service type identification scene, and firstly, network traffic big data is acquired through a wireshark network packet capturing tool, and a service type identification method based on fusion of PCA and XGBoost is provided. By utilizing the thought of feature dimension reduction, complex multidimensional features are simplified into low-dimensional features based on a Principal Component Analysis (PCA) algorithm, the complexity of model training is reduced, then a large number of reliable data sets with labels are trained by utilizing a limit gradient lifting model (XGBoost), and finally service type identification is carried out on actually acquired data traffic. The method combines PCA and XGBoost, reduces algorithm complexity and can improve service type identification.

Based on the service type identification application scenario, the invention provides a service type identification method based on the fusion of PCA and XGBoost, as shown in figure 1, the method comprises the following steps:

Step S1, a part of network flow data set with labels is collected through the Wireshark, wherein the data set comprises HTTP, NTP, DNS, QQ data sets of service types such as WeChat, video and mail, and the specific representation modes of the data set characteristics are as follows:

F＝[f₁,f₂,f₃…f_d]

F is a vector containing d features, and F _i represents the i-th key feature index.

And S2, carrying out data cleaning and feature extraction on the data set with the tag acquired in the step 1, removing repeated samples and invalid samples of incomplete data, and carrying out normalization processing based on the maximum value of each key feature. The specific processing mode is as follows:

Wherein the method comprises the steps of Refers to the ith key characteristic index after normalization. max (f _i) refers to the maximum value at which the i-th key feature index occurs.

And step S3, before training the classification model, reducing complexity of data dimension while ensuring accuracy of the classification model, and simplifying the complex multidimensional feature in the step 2 into a low-dimensional feature by using a method of feature dimension reduction by principal component analysis, namely converting an original d-dimensional dataset into a k-dimensional dataset, wherein k is less than d. Therefore, some important characteristics of the data are reserved, and the specific processing mode of the main component analysis data dimension reduction is as follows:

step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:

Wherein, the superscript T' is a transpose;

Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional data set with the labels in step S3, wherein the key performance indexes form a data set with the labels, the data set is divided into a training set and a testing set, the training set is input into a limit gradient lifting model (XGBoost) for training, the classifier is subjected to parameter tuning, the model subjected to parameter tuning is tested, and the concrete expression form of the limit gradient lifting model (XGBoost) is as follows:

Step S4.1, defining an objective function:

Wherein the method comprises the steps of For measuring predictive value as a loss functionAnd the target value y _i, in this invention, refers to the difference between the predicted traffic type label and the actual traffic type label.

For regularization term, defined as:

Wherein T refers to the number of leaf nodes, lambda is a regularization parameter, gamma is a learning rate, and w _j represents a predicted value of a j-th leaf node. Since XGBoost is the forward distribution algorithm, the t-th result is the result of the previous t-1 times plus the current weak classifier. Each iteration therefore finds the CART tree that maximizes the reduction of the loss function, so the objective function can be rewritten as:

const means that at round t, the regularization term of the previous t-1 iteration can be considered constant. Approximating the objective function by taylor expansion to obtain:

wherein, Removing constant terms for the t-th round of iteration

Obtaining an objective function:

The objective function depends only on the first and second derivatives of each piece of data on the error function. After the regularization term is processed, the final objective function can be rewritten as:

Wherein:

I _j is defined as the index set of samples, whose values are associated with leaf node j.

Assuming that the structure of the decision tree has been determined, the predicted value on each leaf node can be obtained by zeroing the derivative of the loss function, which can be written as:

Thus, the final objective function can be written as:

γ＝(γ_start,γ_end,μ_γ)

λ＝(λ_start,λ_end,μ_λ)

Where p is an integer, q is an integer,

To illustrate the effectiveness of the method of the present invention, examples are given. The acquisition of the example data is obtained by the system constructed by the figure 1 through a network packet capturing tool Wireshark, and the protocol types and the service types mainly obtained are HTTP protocol, NTP protocol, DNS protocol, QQ, weChat, messenger video, E-mail and the like. In addition, 8 key characteristic indexes are considered, namely a time stamp, a source IP, a destination IP, the length of a data packet, an intermediate protocol, a source port number, a destination port number, whether ACK/SYN is contained or not and the like.

And step 1, collecting part of the network flow data set with the label through the Wireshark, and carrying out data cleaning and feature extraction on the data set to remove repeated samples and invalid samples of incomplete data. The specific representation of the dataset features is as follows:

F＝[f₁,f₂,f₃…f_d]

step 2, for facilitating analysis, carrying out normalization processing on the historical data, wherein the specific representation mode of the normalization processing is as follows:

Step 3, simplifying complex multidimensional features into low-dimensional features by using a principal component analysis feature dimension reduction method, namely converting an original d-dimensional dataset into a k-dimensional dataset, so as to keep some features of the dataset, which are most important, and reducing the dimension of the PCA features in the following specific modes:

(1) The original data sets are formed into a matrix X of d rows and n columns.

(2) The original data set being de-centralised, i.e.Where x _i is one sample.

(3) Solving covariance matrix of the centralized data set:

(4) And obtaining eigenvalues and corresponding eigenvectors of the covariance matrix.

(5) And arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom, and taking the first k rows to form a matrix P.

(6) Y=px is the data after the dimension is reduced to k dimension.

Finally, 6 most relevant characteristic indexes for measuring the service types are selected by utilizing a principal component analysis algorithm, and the complexity of the subsequent model training can be reduced by utilizing the algorithm as shown in fig. 2.

Step 4, dividing the data set with the label into a training set and a testing set, inputting the training set and the testing set into a limit gradient lifting model (XGBoost) for training, performing parameter tuning on the classifier, testing the model after parameter tuning, and performing an objective function of the limit gradient lifting model (XGBoost) as follows:

Wherein:

And then searching an optimal segmentation method by adopting a greedy algorithm. The basic idea is to split one leaf node at a time from the root node and select a split according to the specific conditions of each possible split. XGBoost also have specific criteria to select the best segmentation. Substituting the predicted value into the loss function to obtain the minimum value of the loss function.

And step 5, inputting the actually obtained data set without the label into XGBoost classification models for testing to obtain service type identification results.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims

1. The service type identification method based on the fusion of PCA and XGBoost is characterized by comprising the following steps:

step S4.1, limit gradient lifting XGBoost classification model is:

γ＝(γ_start,γ_end,μ_γ)

λ＝(λ_start,λ_end,μ_λ)

Where p is an integer, q is an integer,

2. The method for identifying a service type based on the fusion of PCA and XGBoost as set forth in claim 1, wherein the multidimensional feature F in step S2 is specifically expressed as follows:

F＝[f₁,f₂,f₃…f_d]

3. The method for identifying a service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the method for feature dimension reduction by using principal component analysis in step S3 simplifies the multi-dimensional feature into a low-dimensional feature, assuming that the original multi-dimensional feature is d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k < d;

step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:

Wherein, the superscript T' is a transpose;

4. The method for identifying the service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the actual data set without the label is obtained by a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost subjected to PCA dimension reduction, so as to obtain a service type classification result.

5. The method for identifying a service type based on the fusion of PCA and XGBoost as defined in claim 1, wherein in step S4, the key performance indicators with high correlation include source IP, destination IP, source port number, destination port number, protocol type and data length.