CN117650935A

CN117650935A - Interference flow identification method based on service application classification model

Info

Publication number: CN117650935A
Application number: CN202311677029.9A
Authority: CN
Inventors: 张斌; 蒋伟; 朱启勋; 汪文勇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-05

Abstract

The invention discloses an interference flow identification method based on a service application classification model, which belongs to the technical field of broadband networks and is characterized by comprising the following steps: a. collecting a pure data set 1, splitting according to flow, carrying out feature statistics to form a 54-length feature vector, and training all samples as training data to obtain an initialized model A; b. setting up a test acquisition environment; c. training the initialized model B through a training sample, inputting a test sample into the trained model B to obtain an output test label, and evaluating a test result through a machine learning evaluation index; d. the original flow data is formed into 54 long-characteristic model input data according to the session flow, and the model input data is input into a model C to obtain a large-class application label or an interference flow label. The invention can distinguish the interference flow from the encryption flow, accurately identify the normal service application and the interference flow, and improve the classification accuracy and the model generalization performance.

Description

Interference flow identification method based on service application classification model

Technical Field

The invention relates to the technical field of broadband networks, in particular to an interference flow identification method based on a service application classification model.

Background

Along with the gradual popularization of analysis research in academia and engineering world in the field of internet service application identification, accurate identification of internet service application has an increasingly important meaning to industry researchers and related enterprises, and the enterprises can be helped to acquire user access application conditions at lower cost by identifying corresponding application labels from massive and easily acquired network original flow byte stream data, so that corresponding marketing and user care strategies are formulated.

Because the security level of the internet is greatly improved, most of network traffic is acquired and encrypted, an AI model is usually adopted at present to identify application labels from side channel information of traffic disclosure, and original traffic data formed by a certain user accessing a certain specific application generally contains more traffic sessions which do not belong to the application, wherein the traffic sessions are interference traffic of the application, are general application data or general session connection processes which cannot reflect the characteristics of the application, such as TCP handshake data, the interference traffic data can reduce the accuracy and generalization of the AI model for identifying the application to a great extent, and the method has important significance in identifying and rejecting interference traffic while identifying service application.

The Chinese patent document with publication number of CN116827875A and publication date of 2023, 09 and 29 discloses an APP flow identification and denoising method based on multi-model decision, which comprises the following steps:

step 1: collecting network traffic, removing data corresponding to a non-main IP sub-network, and generating a label set Y for the network traffic according to the APP name;

step 2: extracting network flow characteristics, wherein the network flow characteristics are statistical characteristics extracted from the network flow, and extracting statistical characteristics of message length and time-related characteristics of message length aiming at forward and reverse messages, and preprocessing to form a feature vector set X of the flow;

step 3: after the flow characteristics in the Y and the X are in one-to-one correspondence with the labels, inputting a plurality of classifiers for training, and adjusting parameters of each model according to classification precision on a test set to generate a multi-model decision group { M1, M2 };

step 4: inputting the given network flow after the processing of the step 2 into the multi-model decision group trained in the step 3, and generating a classification result { A1, A2.,.

Step 5: and (3) making a decision according to the classification result obtained in the step (4), and deciding to output the APP result or considering the APP result as noise, namely discarding the result.

Compared with the traditional port matching and deep packet detection, the APP flow identification and denoising method based on multi-model decision disclosed in the patent document has strong universality and simple realization, supports most of APP identification at present, and can remove interference caused by noise in network flows. However, the accuracy of identifying and distinguishing the interference traffic contained in the encrypted HTTPS service application is low, and the classification accuracy and model generalization are poor.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an interference flow identification method based on a service application classification model, which can distinguish interference flow from encrypted flow, accurately identify normal service application and interference flow, and improve classification accuracy and model generalization performance.

The invention is realized by the following technical scheme:

the interference flow identification method based on the service application classification model is characterized by comprising the following steps:

a. collecting a pure data set 1, splitting according to streams, respectively extracting the first 8 packets of instant messaging, network transmission, network storage, network games, network video, web browsing and mail service, performing feature statistics to form a 54-length feature vector, wherein each feature vector corresponds to one sample, and training all samples as training data to obtain an initialized model A;

b. constructing a test acquisition environment, and using a packet capturing tool to test and capture packets of instant messaging, network transmission, network storage, network games, network video, web browsing and mail service application labels to form a verification sample data set 2, and inputting a test sample serving as an initialized model A into the initialized model A for verification;

c. randomly splitting a training sample and a test sample according to the sample, training an initialized model B through the training sample, inputting the trained model B into the test sample to obtain an output test label, and evaluating a test result through a machine learning evaluation index;

d. and training the initialized model C through a training sample, forming 54 long-characteristic model input data of the original flow data according to the session flow, and inputting the model C to obtain a large-class application label or an interference flow label.

And e, providing the obtained large-class application labels or interference flow labels with quasi-real-time flow identification result data through a real-time unified interface.

In the step a, extracting the first 8 packets refers to extracting according to the service application performance characteristics.

In the step a, the clean data set 1 refers to a data set without interference flow.

In the step a, all samples refer to the number of pure streams.

In the step b, inputting the initialized model a as the test sample of the initialized model a for verification specifically means that the flow label with correct classification is kept unchanged, and the flow label with incorrect classification is replaced with the interference flow.

In the step c, the machine learning evaluation index comprises accuracy, recall, F1 score and confusion matrix.

In the step c, the evaluation of the test result specifically means that the interference flow identification effect is analyzed, if the effect is good, the step d is carried out, otherwise, the step a is returned, the sample size is increased or the super parameters of the initialized model A are changed.

The initialized model A, model B and model C are random forest models.

The F1 score refers to a harmonic mean of the precision and recall.

The basic principle of the invention is as follows:

the interference flow identification is essentially a process of semi-supervised learning conversion, and the identification of the interference flow is realized by identifying another batch of marked application data containing unlabeled interference flow through a trained model A and adding an interference flow label, and adding the label to the interference flow so as to convert the semi-supervised learning into supervised learning. After the interference flow label is acquired, the total data combined with the current network sample data can be used for training an interference flow identification model and can be independently provided as an interference flow data set.

Because only the side channel information of the original traffic is used, and a domain name field related to the privacy information of the user is not needed, the interference traffic can be distinguished from the encrypted traffic, and the security is better.

The beneficial effects of the invention are mainly shown in the following aspects:

1. the method comprises the steps of a, collecting a pure data set 1, splitting according to streams, respectively extracting the first 8 packets of instant messaging, network transmission, network storage, network games, network video, web browsing and mail service, performing feature statistics to form a 54-length feature vector, wherein each feature vector corresponds to one sample, and training all samples as training data to obtain an initialized model A; b. constructing a test acquisition environment, and using a packet capturing tool to test and capture packets of instant messaging, network transmission, network storage, network games, network video, web browsing and mail service application labels to form a verification sample data set 2, and inputting a test sample serving as an initialized model A into the initialized model A for verification; c. randomly splitting a training sample and a test sample according to the sample, training an initialized model B through the training sample, inputting the trained model B into the test sample to obtain an output test label, and evaluating a test result through a machine learning evaluation index; d. the initialized model C is trained through a training sample, the original flow data is formed into 54-long-characteristic model input data according to the session flow, the model C is input to obtain a large-class application label or an interference flow label, and compared with the prior art, the method and the device can distinguish the interference flow from the encrypted flow, accurately identify the normal service application and the interference flow, and improve the classification accuracy and the model generalization performance.

2. The invention can eliminate the interference flow so as to purify the network flow, and is convenient for establishing a more accurate service application scene flow data set.

3. The invention only collects the first 8 messages of each session, so the invention has no influence on the user service basically.

4. The model is a random forest model, but not a deep learning model, and has the characteristics of small parameter, high training and predicting reasoning speed and easiness in deployment on a platform.

5. The invention only uses the side channel information of the original flow, such as the packet length and the packet arrival time interval, and can distinguish the interference flow from the encrypted flow without using the domain name field related to the user privacy information, and has better security.

Drawings

The invention will be further specifically described with reference to the drawings and detailed description below:

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example 1

Referring to fig. 1, an interference flow identification method based on a service application classification model includes the following steps:

The embodiment is a most basic implementation mode, a, collecting a pure data set 1 and splitting according to streams, wherein the first 8 packets are respectively extracted from instant messaging, network transmission, network storage, network games, network video, web browsing and mail service, characteristic statistics is carried out to form a 54-length characteristic vector, each characteristic vector corresponds to one sample, and all samples are used as a model A for training and initializing well; b. constructing a test acquisition environment, and using a packet capturing tool to test and capture packets of instant messaging, network transmission, network storage, network games, network video, web browsing and mail service application labels to form a verification sample data set 2, and inputting a test sample serving as an initialized model A into the initialized model A for verification; c. randomly splitting a training sample and a test sample according to the sample, training an initialized model B through the training sample, inputting the trained model B into the test sample to obtain an output test label, and evaluating a test result through a machine learning evaluation index; d. the initialized model C is trained through a training sample, the original flow data is formed into 54-long-characteristic model input data according to the session flow, the model C is input to obtain a large-class application label or an interference flow label, and compared with the prior art, the method and the device can distinguish the interference flow from the encrypted flow, accurately identify the normal service application and the interference flow, and improve the classification accuracy and the model generalization performance.

Example 2

Preferably, the method further comprises step e of providing the obtained large-class application tag or interference flow tag with quasi-real-time flow identification result data through a real-time unified interface.

The embodiment is a preferred implementation manner, and can remove interference flow so as to purify network flow, thereby facilitating establishment of a more accurate service application scene flow data set.

Example 3

Further preferably, in the step a, extracting the first 8 packets refers to extracting according to the performance characteristics of the service application.

In the step a, all samples refer to the number of pure streams.

This embodiment is a further preferred embodiment, and since only the first 8 messages of each session are collected, there is basically no impact on the user traffic.

Example 4

In the step a, all samples refer to the number of pure streams.

Still more preferably, in the step b, the step of inputting the initialized model a as the test sample of the initialized model a for verification means specifically that the flow label with correct classification is kept unchanged, and the flow label with incorrect classification is replaced with the interference flow.

The embodiment is a preferred implementation mode, and the models are random forest models instead of deep learning models, and have the characteristics of small parameter quantity, high training and prediction reasoning speed and easiness in deployment on a platform.

Example 5

In the step a, all samples refer to the number of pure streams.

In this embodiment, as a best mode, only the side channel information of the original traffic, such as the packet length and the packet arrival duration interval, is used, and the domain name field related to the user privacy information is not needed, so that the interference traffic can be distinguished from the encrypted traffic, and meanwhile, the security is better.

The invention is experimentally verified as follows:

the pure data set 1 of the model A is derived from 7 major categories of data collected by a broadband access server on the existing network, including instant messaging, network transmission, network storage, network games, network video, web browsing and mail service, and 17.4GB of data is generated into 84292 sessions after sample equalization processing, and 84292 samples are corresponding. The input model A is used for adding the flow data of the interference flow label and is derived from 5.9GB data of 7 major classes collected by starting a packet grabbing tool when the application is directly started at the PC end, each major class comprises 600-1000MB of stored data messages, 41563 sessions are generated, and 41563 samples are correspondingly generated.

The model adopts a random forest classification model, and the parameters of the random forest model are default parameters by directly calling a machine learning package.

The accuracy of the prediction classification of the model A using the branch acquisition data in the second step is 88.89%, the classification and identification errors of 3682 samples are totally detected, the samples are uniformly changed into interference flow, the samples are combined with the current network sample in the first step, the combined total sample size is 125855, the samples are randomly divided into training samples and testing samples, after the training sample is used for training the model B, the accuracy of the prediction classification of the model B in the testing sample is 90.61%, and the application type sample size is shown in the table 1.

The sample distribution, confusion matrix of model a predictors, confusion matrix of model B predictors, accuracy P of model B predictions for each category, and recall R, F1 score F1 are shown in table 2.

TABLE 1

TABLE 2

General application	P	R	F1
				Instant messaging	0.928	0.8845	0.9057
Network transmission	0.9143	0.9249	0.9195
				Network storage	0.9782	0.9874	0.9828
Network game	0.8361	0.7612	0.7969
				Network video	0.9291	0.8429	0.8839
Web browsing	0.7719	0.898	0.8302
				Mail service	0.9845	0.9922	0.9883
Interference flow	0.8019	0.9043	0.85

From table 2, it can be seen that the accuracy, recall rate and F1 fraction of the interference flow class are all above 80%, which proves that the invention is effective for identifying the interference flow.

Claims

1. The interference flow identification method based on the service application classification model is characterized by comprising the following steps:

2. The interference flow identification method based on the service application classification model according to claim 1, wherein: and e, providing the obtained large-class application labels or interference flow labels with quasi-real-time flow identification result data through a real-time unified interface.

3. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step a, extracting the first 8 packets refers to extracting according to the service application performance characteristics.

4. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step a, the clean data set 1 refers to a data set without interference flow.

5. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step a, all samples refer to the number of pure streams.

6. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step b, inputting the initialized model a as the test sample of the initialized model a for verification specifically means that the flow label with correct classification is kept unchanged, and the flow label with incorrect classification is replaced with the interference flow.

7. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step c, the machine learning evaluation index comprises accuracy, recall, F1 score and confusion matrix.

8. The interference flow identification method based on the service application classification model according to claim 1, wherein: in the step c, the evaluation of the test result specifically means that the interference flow identification effect is analyzed, if the effect is good, the step d is carried out, otherwise, the step a is returned, the sample size is increased or the super parameters of the initialized model A are changed.