[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114048795B - Service type identification method based on PCA and XGBoost fusion - Google Patents

Service type identification method based on PCA and XGBoost fusion Download PDF

Info

Publication number
CN114048795B
CN114048795B CN202111202293.8A CN202111202293A CN114048795B CN 114048795 B CN114048795 B CN 114048795B CN 202111202293 A CN202111202293 A CN 202111202293A CN 114048795 B CN114048795 B CN 114048795B
Authority
CN
China
Prior art keywords
xgboost
dimensional
data set
parameter
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111202293.8A
Other languages
Chinese (zh)
Other versions
CN114048795A (en
Inventor
刘旭
胡俊华
朱晓荣
杨龙祥
朱洪波
江婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111202293.8A priority Critical patent/CN114048795B/en
Publication of CN114048795A publication Critical patent/CN114048795A/en
Application granted granted Critical
Publication of CN114048795B publication Critical patent/CN114048795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a service type identification method based on PCA and XGBoost fusion, which comprises the following steps: step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail; step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features; s3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels; s4, obtaining a trained limit gradient lifting XGBoost classification model; and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result. The invention can not only reduce the complexity of the identification method, but also improve the accuracy of identifying the service types.

Description

Service type identification method based on PCA and XGBoost fusion
Technical Field
The invention relates to the technical field of communication networks, in particular to a service type identification method based on PCA and XGBoost fusion.
Background
With the continuous development of information technology, the internet traffic scale is also increasing year by year, and new business layers of the network are endless. Although the development and development of our society are greatly promoted, the popularization of the new services also absorbs a great amount of customer resources for telecom operators, but the network contains various encrypted traffic, and generates great impact on the bottom traffic model and the upper application mode of the network, and in order to improve network management, improve network service, ensure network environment safety, effectively identify the encrypted traffic of various application services, thereby constructing an operable and manageable network becomes a key research direction at present.
Conventional traffic type identification methods include a port-based traffic identification method and a Deep Packet Inspection (DPI) -based traffic identification method. Port-based traffic identification is the classification of network traffic by known port numbers in the TCP/UDP packet header, which was initially very efficient and easy to implement for real-time traffic classification, but today various network applications do not use well-known ports to avoid being detected, and some network applications may use dynamic port numbers when in use. Therefore, the current port-based traffic classification cannot produce a real result, and the classification accuracy is not high. The flow identification method based on Deep Packet Inspection (DPI) is essentially a data message filtering technology, and besides supporting message header analysis of an L2 layer data link layer, an L3 layer network layer and an L4 layer transmission layer, the DPI also increases analysis of an L7 layer application layer effective load, so that various application types and contents thereof can be identified. However, since most traffic currently uses various encryption techniques to prohibit inspection of packet payloads, the classification accuracy of Deep Packet Inspection (DPI) is not very high. The current trend is to use machine learning methods for IP traffic classification.
In recent years, artificial intelligence technology based on machine learning has achieved attention in terms of computer vision, natural language processing, speech recognition, image medical treatment, etc., and has been far superior to conventional solutions in many fields. The scientificity and effectiveness of machine learning in processing classification tasks are fully revealed, and machine learning and data mining technologies are gradually applied and developed in the field of network space security, so that the machine learning technology also provides possibility for solving the problem of encryption traffic classification solved by the traditional method. The traditional service type identification method can not identify the encrypted traffic and has the problems of low identification accuracy and the like.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a service type identification method based on the fusion of PCA and XGBoost, which not only can reduce the complexity of the identification method, but also can improve the accuracy of service type identification.
The invention adopts the following technical scheme for solving the technical problems:
The invention provides a service type identification method based on PCA and XGBoost fusion, which comprises the following steps:
step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail;
Step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features;
S3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional dataset with the labels in step S3, wherein the key performance indexes form a dataset with labels, the dataset is divided into a training set and a testing set, the training set is input into a limit gradient lifting XGBoost classification model for training, an improved parameter tuning method is adopted for tuning the learning rate gamma and regularization parameter lambda of the XGBoost classification model, the learning rate gamma and regularization parameter lambda which are most suitable for the network flow dataset are obtained, and the XGBoost classification model with the parameters being tuned is tested, so that a trained limit gradient lifting XGBoost classification model is obtained;
the method for optimizing the learning rate gamma and the regularization parameter lambda specifically comprises the following steps:
step S4.1, limit gradient lifting XGBoost classification model is:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstartendγ)
λ=(λstartendλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multidimensional feature F in the step S2 is specifically expressed as follows:
F=[f1,f2,f3…fd]
F contains vectors of d features, F i represents an ith key feature index, d is more than or equal to i and more than or equal to 1, and the maximum value of F i is respectively normalized in the following specific modes:
max (f i) is the maximum value at which the i-th key feature index occurs, Is the ith key characteristic index after normalization treatment.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the multi-dimensional feature is simplified into a low-dimensional feature by utilizing the method of feature dimension reduction of the principal component analysis in the step S3, and the original multi-dimensional feature is assumed to be d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k is less than d;
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
step S3.6, y=px, and Y is a low-dimensional feature obtained by reducing the dimension to k dimensions by the method of feature dimension reduction by principal component analysis.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, the actual data set without the label is acquired through a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost with PCA dimension reduction, so that a service type classification result is obtained.
As a further optimization scheme of the service type identification method based on the fusion of PCA and XGBoost, in step S4, the key performance indexes with high correlation comprise source IP, destination IP, source port number, destination port number, protocol type and data length.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
The method solves the problems that the traditional service type identification method can not identify the encrypted flow and has lower identification accuracy; the accuracy of the test set and the complexity of the algorithm are balanced by utilizing principal component analysis and XGBoost algorithm, so that efficient and reliable service type identification can be realized.
Drawings
Fig. 1 is a flow chart of a service type identification system based on PCA and XGBoost provided by the present invention.
Fig. 2 is a schematic diagram of feature dimension reduction using Principal Component Analysis (PCA) algorithm.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The invention provides a thinking of feature degradation and deep learning to be applied to a service type identification scene, and firstly, network traffic big data is acquired through a wireshark network packet capturing tool, and a service type identification method based on fusion of PCA and XGBoost is provided. By utilizing the thought of feature dimension reduction, complex multidimensional features are simplified into low-dimensional features based on a Principal Component Analysis (PCA) algorithm, the complexity of model training is reduced, then a large number of reliable data sets with labels are trained by utilizing a limit gradient lifting model (XGBoost), and finally service type identification is carried out on actually acquired data traffic. The method combines PCA and XGBoost, reduces algorithm complexity and can improve service type identification.
Based on the service type identification application scenario, the invention provides a service type identification method based on the fusion of PCA and XGBoost, as shown in figure 1, the method comprises the following steps:
Step S1, a part of network flow data set with labels is collected through the Wireshark, wherein the data set comprises HTTP, NTP, DNS, QQ data sets of service types such as WeChat, video and mail, and the specific representation modes of the data set characteristics are as follows:
F=[f1,f2,f3…fd]
F is a vector containing d features, and F i represents the i-th key feature index.
And S2, carrying out data cleaning and feature extraction on the data set with the tag acquired in the step 1, removing repeated samples and invalid samples of incomplete data, and carrying out normalization processing based on the maximum value of each key feature. The specific processing mode is as follows:
Wherein the method comprises the steps of Refers to the ith key characteristic index after normalization. max (f i) refers to the maximum value at which the i-th key feature index occurs.
And step S3, before training the classification model, reducing complexity of data dimension while ensuring accuracy of the classification model, and simplifying the complex multidimensional feature in the step 2 into a low-dimensional feature by using a method of feature dimension reduction by principal component analysis, namely converting an original d-dimensional dataset into a k-dimensional dataset, wherein k is less than d. Therefore, some important characteristics of the data are reserved, and the specific processing mode of the main component analysis data dimension reduction is as follows:
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional data set with the labels in step S3, wherein the key performance indexes form a data set with the labels, the data set is divided into a training set and a testing set, the training set is input into a limit gradient lifting model (XGBoost) for training, the classifier is subjected to parameter tuning, the model subjected to parameter tuning is tested, and the concrete expression form of the limit gradient lifting model (XGBoost) is as follows:
Step S4.1, defining an objective function:
Wherein the method comprises the steps of For measuring predictive value as a loss functionAnd the target value y i, in this invention, refers to the difference between the predicted traffic type label and the actual traffic type label.
For regularization term, defined as:
Wherein T refers to the number of leaf nodes, lambda is a regularization parameter, gamma is a learning rate, and w j represents a predicted value of a j-th leaf node. Since XGBoost is the forward distribution algorithm, the t-th result is the result of the previous t-1 times plus the current weak classifier. Each iteration therefore finds the CART tree that maximizes the reduction of the loss function, so the objective function can be rewritten as:
const means that at round t, the regularization term of the previous t-1 iteration can be considered constant. Approximating the objective function by taylor expansion to obtain:
wherein, Removing constant terms for the t-th round of iteration
Obtaining an objective function:
The objective function depends only on the first and second derivatives of each piece of data on the error function. After the regularization term is processed, the final objective function can be rewritten as:
Wherein:
I j is defined as the index set of samples, whose values are associated with leaf node j.
Assuming that the structure of the decision tree has been determined, the predicted value on each leaf node can be obtained by zeroing the derivative of the loss function, which can be written as:
Thus, the final objective function can be written as:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstartendγ)
λ=(λstartendλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
To illustrate the effectiveness of the method of the present invention, examples are given. The acquisition of the example data is obtained by the system constructed by the figure 1 through a network packet capturing tool Wireshark, and the protocol types and the service types mainly obtained are HTTP protocol, NTP protocol, DNS protocol, QQ, weChat, messenger video, E-mail and the like. In addition, 8 key characteristic indexes are considered, namely a time stamp, a source IP, a destination IP, the length of a data packet, an intermediate protocol, a source port number, a destination port number, whether ACK/SYN is contained or not and the like.
And step 1, collecting part of the network flow data set with the label through the Wireshark, and carrying out data cleaning and feature extraction on the data set to remove repeated samples and invalid samples of incomplete data. The specific representation of the dataset features is as follows:
F=[f1,f2,f3…fd]
step 2, for facilitating analysis, carrying out normalization processing on the historical data, wherein the specific representation mode of the normalization processing is as follows:
Wherein the method comprises the steps of Refers to the ith key characteristic index after normalization. max (f i) refers to the maximum value at which the i-th key feature index occurs.
Step 3, simplifying complex multidimensional features into low-dimensional features by using a principal component analysis feature dimension reduction method, namely converting an original d-dimensional dataset into a k-dimensional dataset, so as to keep some features of the dataset, which are most important, and reducing the dimension of the PCA features in the following specific modes:
(1) The original data sets are formed into a matrix X of d rows and n columns.
(2) The original data set being de-centralised, i.e.Where x i is one sample.
(3) Solving covariance matrix of the centralized data set:
(4) And obtaining eigenvalues and corresponding eigenvectors of the covariance matrix.
(5) And arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom, and taking the first k rows to form a matrix P.
(6) Y=px is the data after the dimension is reduced to k dimension.
Finally, 6 most relevant characteristic indexes for measuring the service types are selected by utilizing a principal component analysis algorithm, and the complexity of the subsequent model training can be reduced by utilizing the algorithm as shown in fig. 2.
Step 4, dividing the data set with the label into a training set and a testing set, inputting the training set and the testing set into a limit gradient lifting model (XGBoost) for training, performing parameter tuning on the classifier, testing the model after parameter tuning, and performing an objective function of the limit gradient lifting model (XGBoost) as follows:
Wherein:
And then searching an optimal segmentation method by adopting a greedy algorithm. The basic idea is to split one leaf node at a time from the root node and select a split according to the specific conditions of each possible split. XGBoost also have specific criteria to select the best segmentation. Substituting the predicted value into the loss function to obtain the minimum value of the loss function.
And step 5, inputting the actually obtained data set without the label into XGBoost classification models for testing to obtain service type identification results.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims (5)

1. The service type identification method based on the fusion of PCA and XGBoost is characterized by comprising the following steps:
step S1, collecting a network flow data set with a label, wherein the service types of the network flow data set comprise HTTP, NTP, DNS, QQ, weChat, video and mail;
Step S2, carrying out data cleaning and feature extraction on the network flow data set in the step S1, thereby obtaining a network flow data set containing multidimensional features;
S3, simplifying the multi-dimensional features in the step S2 into low-dimensional features by using a method for reducing the dimension of the main component analysis features to obtain a low-dimensional data set with labels;
Step S4, selecting key performance indexes with high relevance for measuring service types from the low-dimensional dataset with the labels in step S3, wherein the key performance indexes form a dataset with labels, the dataset is divided into a training set and a testing set, the training set is input into a limit gradient lifting XGBoost classification model for training, an improved parameter tuning method is adopted for tuning the learning rate gamma and regularization parameter lambda of the XGBoost classification model, the learning rate gamma and regularization parameter lambda which are most suitable for the network flow dataset are obtained, and the XGBoost classification model with the parameters being tuned is tested, so that a trained limit gradient lifting XGBoost classification model is obtained;
the method for optimizing the learning rate gamma and the regularization parameter lambda specifically comprises the following steps:
step S4.1, limit gradient lifting XGBoost classification model is:
Wherein T refers to the number of leaf nodes, obj refers to an objective function, G j refers to a first derivative of the objective function in the Taylor expansion of the jth leaf node, and H j refers to a second derivative of the objective function in the Taylor expansion of the jth leaf node;
Step S4.2, parameter tuning is performed on the learning rate gamma and the regularization parameter lambda in step S4.1:
Step S4.2.1, setting a search space Φ and a search step size μ of the learning rate γ and the regularization parameter λ, where the settings are as follows:
γ=(γstartendγ)
λ=(λstartendλ)
wherein, gamma start and gamma end are the upper boundary and the lower boundary of the search space phi γ of the learning rate gamma, and mu γ is the search step length of the learning rate gamma; lambda start and lambda end are the upper and lower boundaries of the search space Θ λ of the regularization parameter lambda, respectively, and mu λ is the search step size of the regularization parameter lambda;
Step S4.2.2, generating a two-dimensional search parameter set matrix H S according to the set search space and search step length, wherein the definition is as follows:
Where p is an integer, q is an integer,
Step S4.2.3, for each parameter set in H S in step S4.2.2, evaluating the average classification precision of the XGBoost classification model in each parameter set, selecting the parameter set with the highest evaluated average classification precision, if the number of parameter sets with the highest average classification precision is 1, the parameter set is the selected parameter set, and if the number of parameter sets with the highest evaluated average classification precision is multiple, selecting the parameter set with the smallest lambda start+qμλ in the multiple parameter sets;
Step S4.2.4, wherein lambda start+qμλ、γstart+pμγ in the parameter set selected in step S4.2.3 is the optimal learning rate gamma and regularization parameter lambda of the XGBoost classification model corresponding to the labeled low-dimensional dataset in step S3;
and S5, inputting the network flow data set to be tested into the limit gradient lifting XGBoost classification model in the step S4 to obtain a service type classification result.
2. The method for identifying a service type based on the fusion of PCA and XGBoost as set forth in claim 1, wherein the multidimensional feature F in step S2 is specifically expressed as follows:
F=[f1,f2,f3…fd]
F contains vectors of d features, F i represents an ith key feature index, d is more than or equal to i and more than or equal to 1, and the maximum value of F i is respectively normalized in the following specific modes:
max (f i) is the maximum value at which the i-th key feature index occurs, Is the ith key characteristic index after normalization treatment.
3. The method for identifying a service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the method for feature dimension reduction by using principal component analysis in step S3 simplifies the multi-dimensional feature into a low-dimensional feature, assuming that the original multi-dimensional feature is d-dimensional, and the simplified low-dimensional feature is k-dimensional, wherein k < d;
s3.1, forming a matrix X of d rows and n columns by the network flow data set containing the multidimensional features in the step S2, wherein the network flow data set containing the multidimensional features has n samples, and each sample has d-dimensional features;
step S3.2, performing decentering treatment on the matrix X of d rows and n columns in the step S2 to obtain a decentered matrix X';
step S3.3, solving a covariance matrix Cov of the matrix X' in the step S3.2:
Wherein, the superscript T' is a transpose;
S3.4, obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;
s3.5, arranging the obtained feature vectors into a matrix according to the corresponding feature values from top to bottom in rows, and taking the first k rows to form a matrix P;
step S3.6, y=px, and Y is a low-dimensional feature obtained by reducing the dimension to k dimensions by the method of feature dimension reduction by principal component analysis.
4. The method for identifying the service type based on the fusion of PCA and XGBoost as claimed in claim 1, wherein the actual data set without the label is obtained by a network packet capturing tool Wireshark and is input into a classification model of a limit gradient lifting model XGBoost subjected to PCA dimension reduction, so as to obtain a service type classification result.
5. The method for identifying a service type based on the fusion of PCA and XGBoost as defined in claim 1, wherein in step S4, the key performance indicators with high correlation include source IP, destination IP, source port number, destination port number, protocol type and data length.
CN202111202293.8A 2021-10-15 2021-10-15 Service type identification method based on PCA and XGBoost fusion Active CN114048795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111202293.8A CN114048795B (en) 2021-10-15 2021-10-15 Service type identification method based on PCA and XGBoost fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111202293.8A CN114048795B (en) 2021-10-15 2021-10-15 Service type identification method based on PCA and XGBoost fusion

Publications (2)

Publication Number Publication Date
CN114048795A CN114048795A (en) 2022-02-15
CN114048795B true CN114048795B (en) 2024-11-08

Family

ID=80205105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111202293.8A Active CN114048795B (en) 2021-10-15 2021-10-15 Service type identification method based on PCA and XGBoost fusion

Country Status (1)

Country Link
CN (1) CN114048795B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114978585B (en) * 2022-04-12 2024-02-27 国家计算机网络与信息安全管理中心 Deep learning symmetric encryption protocol identification method based on flow characteristics
CN115277585B (en) * 2022-07-08 2023-07-28 南京邮电大学 Multi-granularity business flow identification method based on machine learning
CN116975401A (en) * 2023-09-19 2023-10-31 杭州美创科技股份有限公司 Database field identification method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107682216A (en) * 2017-09-01 2018-02-09 南京南瑞集团公司 A kind of network traffics protocol recognition method based on deep learning
CN111586728A (en) * 2020-04-30 2020-08-25 南京邮电大学 Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309630B (en) * 2018-09-25 2021-09-21 深圳先进技术研究院 Network traffic classification method and system and electronic equipment
CN109639481B (en) * 2018-12-11 2020-10-27 深圳先进技术研究院 Deep learning-based network traffic classification method and system and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107682216A (en) * 2017-09-01 2018-02-09 南京南瑞集团公司 A kind of network traffics protocol recognition method based on deep learning
CN111586728A (en) * 2020-04-30 2020-08-25 南京邮电大学 Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method

Also Published As

Publication number Publication date
CN114048795A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
CN114048795B (en) Service type identification method based on PCA and XGBoost fusion
CN112564974B (en) Deep learning-based fingerprint identification method for Internet of things equipment
CN113037730B (en) Network encryption traffic classification method and system based on multi-feature learning
CN113705712B (en) Network traffic classification method and system based on federal semi-supervised learning
CN112671757B (en) Encryption flow protocol identification method and device based on automatic machine learning
CN112163594A (en) Network encryption traffic identification method and device
CN113079069B (en) Mixed granularity training and classifying method for large-scale encrypted network traffic
CN114172688B (en) Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL)
CN113989583A (en) Method and system for detecting malicious traffic of internet
CN116260642A (en) Knowledge distillation space-time neural network-based lightweight Internet of things malicious traffic identification method
CN109299185B (en) Analysis method for convolutional neural network extraction features aiming at time sequence flow data
CN108319672A (en) Mobile terminal malicious information filtering method and system based on cloud computing
CN114553722A (en) VPN and non-VPN network flow classification method based on multi-view one-dimensional convolution neural network
Xue et al. Classification and identification of unknown network protocols based on CNN and T-SNE
Yang et al. Deep learning-based reverse method of binary protocol
CN116451138A (en) Encryption traffic classification method, device and storage medium based on multi-modal learning
CN114826776B (en) Weak supervision detection method and system for encrypting malicious traffic
CN113935398B (en) Network traffic classification method and system based on small sample learning in Internet of things environment
Ding et al. Network attack detection method based on convolutional neural network
CN107239787A (en) A kind of utilization multi-source data have the Image classification method of privacy protection function
CN114979017B (en) Deep learning protocol identification method and system based on original flow of industrial control system
Wei Deep learning model under complex network and its application in traffic detection and analysis
Wu et al. Identifying potential standard essential patents based on text mining and generative topographic mapping
CN112367325B (en) Unknown protocol message clustering method and system based on closed frequent item mining
He et al. Identification of SSH applications based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant