CN104468262B

CN104468262B - A kind of network protocol identification method and system based on semantic sensitivity

Info

Publication number: CN104468262B
Application number: CN201410652834.0A
Authority: CN
Inventors: 云晓春; 张永铮; 王鹏; 王一鹏; 周宇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2014-11-17
Filing date: 2014-11-17
Publication date: 2017-12-15
Anticipated expiration: 2034-11-17
Also published as: CN104468262A

Abstract

The invention relates to a network protocol identification method and system based on semantic sensitivity. In the modeling stage, the network data packet collection of a specific application protocol is used as input, and the keyword model of the analyzed protocol is constructed using the Latent Dirichlet Allocation method; in the training stage, the classification feature information of the data packet is extracted according to the protocol keyword model, The obtained keyword feature vector is used as input, and the supervised machine learning method is used to learn and train the offline training data set to obtain the classification model of the analyzed protocol; in the classification stage, the classification feature information of the data message is extracted according to the protocol keyword model, The protocol classification model is used to judge the protocol attribute of the network data message to be tested, and judge whether it belongs to the network data message of the target protocol. The invention can fully excavate the potential protocol semantic information in the network message message, and effectively identify various network protocols.

Description

A semantic-sensitive network protocol identification method and system

技术领域technical field

本发明属于网络技术领域，具体涉及一种基于语义敏感的网络协议识别方法及系统。The invention belongs to the field of network technology, and in particular relates to a semantic-sensitive network protocol identification method and system.

背景技术Background technique

协议识别技术，是将网络数据流与具体应用协议相对应的过程，在许多的网络和安全领域中都有着十分重要的应用，例如网络测量、入侵检测与防范和僵尸网络行为检测等。以其在入侵检测和防范系统中的应用为例，传统的入侵检测和防范系统将每个数据包载荷部分视为一系列的字节序列，并将这些字节的序列信息与恶意程序(malware)的签名(通常是由一组正则表达式来表示)进行匹配操作。这种粗粒度的签名检查机制由于其忽视了数据包载荷部分的应用协议结构从而在可靠性上受到了极大的限制。Protocol identification technology is the process of corresponding network data flow with specific application protocols. It has very important applications in many network and security fields, such as network measurement, intrusion detection and prevention, and botnet behavior detection. Taking its application in intrusion detection and prevention systems as an example, traditional intrusion detection and prevention systems regard each packet payload as a series of byte sequences, and compare the sequence information of these bytes with malicious programs (malware ) signature (usually represented by a set of regular expressions) for matching operations. This kind of coarse-grained signature checking mechanism is greatly limited in reliability because it ignores the application protocol structure of the data packet payload.

现代化的入侵检测和防范系统正演变的更加语义敏感。具体地说，其依照应用协议的信息格式来获得所分析应用协议的字段信息从而实现对网络数据包的合理解析。几种类型的应用协议解析工具，如FlowSifter，UltraPAC，binpac，和GAPA，已经在此前的研究工作中提出。所有这些应用协议解析工具都需要所分析协议的协议规范信息，从而产生对应于该协议的解析工具。然而，互联网中许多应用协议是私有协议，并这些协议没有公开可得到的协议指纹规范信息。根据Internet2NetFlow组织对骨干网中流量的统计发现：超过40％的网络数据流属于未识别的应用协议。被恶意程序(malware)和僵尸网络(botnet)所使用的网络通信协议没有来自于其设计者的协议规范信息。为了解析未知应用协议网络数据流，首先需要进行协议推断从而获得协议指纹信息。网络监控工具，例如Wireshark，NetDude，SNORT和BRO等也需要应用协议解析工具来实现他们的功能。Modern intrusion detection and prevention systems are evolving to be more semantically sensitive. Specifically, it obtains the field information of the analyzed application protocol according to the information format of the application protocol, so as to realize a reasonable analysis of the network data packet. Several types of application protocol parsing tools, such as FlowSifter, UltraPAC, binpac, and GAPA, have been proposed in previous research works. All these application protocol analysis tools need the protocol specification information of the protocol to be analyzed, so as to generate the analysis tool corresponding to the protocol. However, many application protocols in the Internet are proprietary protocols, and there is no publicly available protocol fingerprint specification information for these protocols. According to the Internet2NetFlow organization's statistics on the traffic in the backbone network, it is found that more than 40% of the network data flows belong to unidentified application protocols. Network communication protocols used by malware and botnets have no protocol specification information from their designers. In order to analyze the network data flow of an unknown application protocol, it is first necessary to infer the protocol to obtain the protocol fingerprint information. Network monitoring tools, such as Wireshark, NetDude, SNORT and BRO, etc. also need application protocol analysis tools to realize their functions.

网络协议识别方法根据其研究对象的不同可划分为基于传输层端口、基于数据报文载荷和基于网络数据流统计行为特征三种类别。本发明以网络数据报文的载荷作为基本研究对象。本领域的现有方法大致可划分为两中类别：(1)基于协议解析的方法；(2)基于协议签名的方法。本发明属于第二类，基于协议签名的方法。基于协议签名的方法在分析过程中只依赖于对数据报文载荷的分析，不依赖于应用程序的可执行代码。此前，关于协议签名自动构建的研究工作未使用存在于数据报文中的潜在语义信息，即数据报文中语法元素之前的关联关系。值得注意的是此类研究工作不能达到实现使用更少分类特征且达到更高准确率的研究目标。同时，相较于此前的研究工作，本发明对所分析的网络协议本身作出更少的前提假设。Network protocol identification methods can be divided into three categories based on the transport layer port, based on the data packet load and based on the statistical behavior characteristics of network data flow according to the different research objects. The present invention takes the load of network data message as the basic research object. Existing methods in this field can be roughly divided into two categories: (1) methods based on protocol analysis; (2) methods based on protocol signatures. The present invention belongs to the second category, methods based on protocol signatures. The method based on the protocol signature only relies on the analysis of the data message load in the analysis process, and does not depend on the executable code of the application program. Previously, the research work on the automatic construction of protocol signatures did not use the latent semantic information existing in the data message, that is, the association relationship between the syntactic elements in the data message. It is worth noting that such research work cannot achieve the research goal of using fewer classification features and achieving higher accuracy. At the same time, compared with previous research work, the present invention makes fewer assumptions on the analyzed network protocol itself.

发明内容Contents of the invention

本发明的目的在于设计并实现一种基于语义敏感的网络协议识别方法及系统，使得在网络协议识别过程中，充分挖掘网络消息报文中潜在的协议语义信息；在保证较高的识别准确率与召回率的前提下，在实践中同时具有较强的普适性与鲁棒性。The purpose of the present invention is to design and implement a network protocol recognition method and system based on semantic sensitivity, so that in the network protocol recognition process, the potential protocol semantic information in the network message message can be fully tapped; while ensuring a higher recognition accuracy Under the premise of the recall rate, it has strong universality and robustness in practice.

本发明的发明动机来源于持续上升的多种多样的未知网络流量，设计的新颖的协议识别方法及系统以最小人力需求为基本前提，实现特定应用协议识别过程的全面自动化。The invention motivation of the present invention comes from the continuously increasing variety of unknown network traffic, and the novel protocol identification method and system designed are based on the minimum manpower requirement to realize the comprehensive automation of the specific application protocol identification process.

具体来说，本发明采用如下技术方案：Specifically, the present invention adopts the following technical solutions:

一种基于语义敏感的网络协议识别方法，包括建模阶段、训练阶段和分类阶段；A semantic-sensitive network protocol recognition method, including a modeling stage, a training stage and a classification stage;

在建模阶段，以特定应用协议的网络数据报文集合作为输入，利用LatentDirichlet Allocation方法构建所分析协议的关键字模型；In the modeling stage, the network data packet collection of a specific application protocol is used as input, and the keyword model of the analyzed protocol is constructed using the LatentDirichlet Allocation method;

在训练阶段，依照建模阶段得到的协议关键字模型提取数据报文的分类特征信息，以获得的关键字特征向量作为输入，利用有监督机器学习方法对离线训练数据集学习训练，从而获得所分析协议的分类模型；In the training stage, according to the protocol keyword model obtained in the modeling stage, the classification feature information of the data message is extracted, and the obtained keyword feature vector is used as input, and the offline training data set is learned and trained by a supervised machine learning method, so as to obtain all Classification models for analysis protocols;

在分类阶段，依照建模阶段得到的协议关键字模型提取数据报文的分类特征信息，利用训练阶段输出的协议分类模型，对待测网络数据报文的协议属性做出判别，判断其是否属于目标协议的网络数据报文。In the classification stage, the classification feature information of the data packet is extracted according to the protocol keyword model obtained in the modeling stage, and the protocol classification model output in the training stage is used to judge the protocol attributes of the network data packet to be tested, and determine whether it belongs to the target Protocol network data packets.

进一步地，所述建模阶段的具体步骤包括：Further, the specific steps of the modeling phase include:

1)采集属于特定应用协议的网络数据报文，从而将网络数据报文划分为两种类别：一类是属于所要分析的应用协议的数据报文集合；另一类是不属于所要分析应用协议的数据报文集；1) Collect network data packets belonging to a specific application protocol, thereby dividing network data packets into two categories: one is a collection of data packets belonging to the application protocol to be analyzed; the other is a collection of data packets that do not belong to the application protocol to be analyzed datagram collection;

2)利用n-gram模型将网络数据报文转化为以n-gram元素作为基本单元的网络数据报文；所述n-gram模型是给定序列的n个连续元素的子序列；2) Utilize the n-gram model to convert the network data message into a network data message with n-gram elements as the basic unit; the n-gram model is a subsequence of n consecutive elements of a given sequence;

3)利用基于Latent Dirichlet Allocation方法构建所要分析协议的协议关键字模型。3) Construct the protocol keyword model of the protocol to be analyzed by using the Latent Dirichlet Allocation method.

进一步地，所述利用Latent Dirichlet Allocation方法构建协议关键字模型的具体步骤包括：Further, the specific steps of using the Latent Dirichlet Allocation method to construct the protocol keyword model include:

1)为包含有M个数据报文的集合D中的所有n-gram分配一个随机的关键字索引号这里w_(m,i)代表数据报文m中，第i个n-gram，z_(m,i)是该n-gram的关键字索引号，N_m是数据报文m中n-gram元素的个数；1) For all n-grams in the set D containing M data packets Assign a random keyword index number Here w _{(m, i)} represents the i-th n-gram in the data message m, z _{(m, i)} is the keyword index number of the n-gram, and N _m is the n-gram element in the data message m the number of

2)用代表除z_(m,i)以外的所有其他n-gram的关键字索引号，在数值保持不变的情况下，根据后验概率分布为n-gram w_(m,i)通过采样的方法产生一个新的关键字索引号数值z_(m,i)；其中α和β是给定的超参数，代表n-gram字典中元素t分配给关键字k的次数，代表消息报文m中关键字k出现的次数，W代表n-gram字典中n-gram元素的个数；2) with Represents the keyword index number of all other n-grams except z _(m,i) , in When the value remains constant, according to the posterior probability distribution Generate a new keyword index value z _(m,i) for n-gram w _(m,i) by sampling; where α and β are given hyperparameters, Represents the number of times element t in the n-gram dictionary is assigned to key k, Represents the number of occurrences of keyword k in the message message m, and W represents the number of n-gram elements in the n-gram dictionary;

3)根据Gibbs采样方法得到的z_(m,i)数值，对后验概率分布中的过期数值进行更新；3) According to the value of z _{(m, i)} obtained by the Gibbs sampling method, the expired value in the posterior probability distribution is updated;

4)对数据集合中的所有的元祖(m,i)都重复上述的采样操作，若达到Gibbs采样收敛条件L，则算法中止，返回最终的关键字索引号否则重复步骤1)至3)；4) Repeat the above sampling operation for all tuples (m, i) in the data set. If the Gibbs sampling convergence condition L is reached, the algorithm stops and returns the final keyword index number Otherwise repeat steps 1) to 3);

5)利用通过步骤1)至4)得到的关键字索引号构建协议关键字模型5) Construct a protocol keyword model using the keyword index number obtained through steps 1) to 4)

其中K代表协议关键字的个数， Where K represents the number of protocol keywords,

一种采用上述方法的基于语义敏感的网络协议识别系统，其包括：A semantically sensitive network protocol recognition system based on the above method, comprising:

建模单元，以特定应用协议的网络数据报文集合作为输入，利用LatentDirichlet Allocation模型构建所分析协议的关键字模型；The modeling unit uses the network data packet collection of the specific application protocol as input, and uses the LatentDirichlet Allocation model to construct the keyword model of the analyzed protocol;

训练单元，依照建模单元得到的协议关键字模型提取数据报文的分类特征信息，以获得的关键字特征向量作为输入，利用有监督机器学习方法对离线训练数据集学习训练，从而获得所分析协议的分类模型；The training unit extracts the classification feature information of the data message according to the protocol keyword model obtained by the modeling unit, and takes the obtained keyword feature vector as input, and uses the supervised machine learning method to learn and train the offline training data set, so as to obtain the analyzed The classification model of the agreement;

分类单元，依照建模单元得到的协议关键字模型提取数据报文的分类特征信息，利用训练单元输出的协议分类模型，对待测网络数据报文的协议属性做出判别，判断其是否属于目标协议的网络数据报文。The classification unit extracts the classification feature information of the data message according to the protocol keyword model obtained by the modeling unit, and uses the protocol classification model output by the training unit to make a judgment on the protocol attribute of the network data message to be tested, and judge whether it belongs to the target protocol network data packets.

本发明的关键技术点在于：Key technical points of the present invention are:

1)充分利用了协议消息报文中存在的潜在语义信息。本发明能够区分不同消息中相同n-grams元素所表示的不同含义。这些不同消息可能有不同的语义，因此应该被归类为不同的协议关键字。值得注意的是此前基于网络数据流的协议信息格式推断方法不能较好地处理上面所述的问题。因为之前方法大多依赖统计字符串出现的频次，从而忽略了每个字符串出现的上下文环境。1) Make full use of the latent semantic information in the protocol message. The present invention can distinguish different meanings represented by the same n-grams element in different messages. These different messages may have different semantics and thus should be classified as different protocol keywords. It is worth noting that the previous protocol information format inference method based on network data flow cannot deal with the above-mentioned problems well. Because most of the previous methods rely on counting the frequency of occurrences of strings, thus ignoring the context in which each string appears.

2)此外，本发明可以发现不同n-grams之间的关联性。在协议消息报文中，多个n-grams一起可以形成协议信息格式中的一个元素。例如，在一个SMTP消息报文中，3-grams，“250”和“OK”共同来表征一个可用于确认邮件会话的协议元素。利用协议关键字识别，本发明可以将互相关联的n-grams聚合到一起，进而形成一个协议关键字。2) In addition, the present invention can discover the correlation between different n-grams. In a protocol message, multiple n-grams together form an element in the protocol message format. For example, in an SMTP message, the 3-grams, "250" and "OK" together represent a protocol element that can be used to confirm a mail session. Utilizing protocol keyword recognition, the present invention can aggregate interrelated n-grams to form a protocol keyword.

利用本发明的方法能对多种网络协议进行有效的协议识别，与已公开的相关技术相比，具有如下优点：Utilize the method of the present invention to carry out effective protocol identification to multiple network protocols, compared with the disclosed related technology, it has the following advantages:

1.该方法可解决面向连接协议(如TCP)和面向无连接协议(如UDP)的应用协议识别问题；1. The method can solve the application protocol identification problem of connection-oriented protocols (such as TCP) and connectionless protocols (such as UDP);

2.该方法是基于数据报文的载荷统计信息，其不假定协议规范的任何先验知识。因此,可适用于文本，二进制和加密类协议的识别；2. The method is based on the payload statistics of data packets, which does not assume any prior knowledge of the protocol specification. Therefore, it can be applied to the identification of text, binary and encrypted protocols;

3.作为一种基于报文的网络协议识别解决方法，该方法不需要将IP数据报文组装成应用层消息。因此，其适用同时适用于逐包和逐流的协议分类方案。3. As a packet-based network protocol identification solution, the method does not need to assemble IP data packets into application layer messages. Therefore, it applies to both packet-by-packet and per-flow protocol classification schemes.

4.该方法对大长流(如SMTP)和小短流(如FTP)在真实网络环境中都适用。4. This method is applicable to both large and long flows (such as SMTP) and small and short flows (such as FTP) in real network environments.

附图说明Description of drawings

图1是基于语义敏感的网络协议识别方法建模阶段流程图。Figure 1 is a flowchart of the modeling stage of the network protocol recognition method based on semantic sensitivity.

图2是基于语义敏感的网络协议识别方法的训练阶段流程图。Fig. 2 is a flowchart of the training phase of the network protocol recognition method based on semantic sensitivity.

图3是基于语义敏感的网络协议识别方法的分类阶段流程图。Fig. 3 is a flowchart of the classification stage of the network protocol recognition method based on semantic sensitivity.

图4是基于Latent Dirichlet Allocation方法的协议关键字模型构建流程图。Fig. 4 is a flow chart of protocol keyword model construction based on Latent Dirichlet Allocation method.

图5是基于语义敏感的网络协议识别系统的架构图。Fig. 5 is an architecture diagram of a semantic-sensitive network protocol recognition system.

具体实施方式detailed description

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步说明。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be further described below through specific embodiments and accompanying drawings.

本发明的基于语义敏感的网络协议识别方法，以网络数据流为输入，自动地从混杂网络流量中对所分析协议的网络数据流进行准确识别。该方法只分析IP数据报文的载荷部分，不需要对程序的可执行代码进行逆向分析，也不依赖协议规范中的先验知识。同时，该方法可解决面向连接协议(如TCP)和面向无连接协议(如UDP)的识别问题。该方法由三个主要阶段构成：建模阶段、训练阶段和分类阶段。The network protocol identification method based on semantic sensitivity of the present invention takes the network data flow as input, and automatically and accurately recognizes the network data flow of the analyzed protocol from the mixed network flow. The method only analyzes the load part of the IP data message, and does not need to reverse-analyze the executable code of the program, and does not rely on prior knowledge in the protocol specification. At the same time, the method can solve the identification problem of connection-oriented protocols (such as TCP) and connection-oriented protocols (such as UDP). The method consists of three main phases: a modeling phase, a training phase, and a classification phase.

建模阶段由数据采集、数据报文n-gram产生、关键字模型构建(关键字识别)三个模块构成。其流程图如图1所示，具体说明如下：The modeling stage consists of three modules: data collection, data message n-gram generation, and keyword model construction (keyword recognition). Its flow chart is shown in Figure 1, and the specific description is as follows:

1.数据采集：数据采集模块的作用是采集属于特定应用协议的网络数据报文。从而将网络数据报文划分为两种类别：一类是属于所要分析的应用协议的数据报文集合；另一类是不属于所要分析应用协议的数据报文集合。1. Data collection: The function of the data collection module is to collect network data packets belonging to specific application protocols. Therefore, the network data packets are divided into two categories: one is a collection of data packets belonging to the application protocol to be analyzed; the other is a collection of data packets not belonging to the application protocol to be analyzed.

2.数据报文n-gram产生：数据包n-gram产生操作利用n-gram模型将网络数据报文转化为以n-gram元素作为基本单元的网络数据报文。本发明所述的n-gram模型是给定序列的(至少为n个元素的序列)n个连续元素的子序列。给定网络数据报文集合，n-gram模型将字节大小为m的网络数据包序列b₁,b₂,…,b_m分解为n-grams(n≤m)序列：b₁,b₂,…,b_n，b₂,b₃,…,b_n+1，…，b_m-n+1,b_m-n+2,…,b_m。在实践过程中，通常只选择统计频率较高的前W个n-gram元素，并形成其n-gram字典。2. Data packet n-gram generation: the data packet n-gram generation operation uses the n-gram model to convert the network data packet into a network data packet with n-gram elements as the basic unit. The n-gram model described in the present invention is a subsequence of n consecutive elements of a given sequence (at least a sequence of n elements). Given a set of network data packets, the n-gram model decomposes the network packet sequence b ₁ , b ₂ ,…,b _m with byte size m into a sequence of n-grams (n≤m): b ₁ , b ₂ ,...,b _n , b ₂ ,b ₃ ,...,b _n+1 ,...,b _m-n+1 ,b _m-n+2 ,...,b _m . In practice, usually only the first W n-gram elements with higher statistical frequency are selected and their n-gram dictionary is formed.

3.关键字识别：关键字识别模块利用基于Latent Dirichlet Allocation(LDA)方法构建所要分析协议的协议关键字模型。建模阶段的输出结果是所分析网络协议的协议关键字模型。3. Keyword recognition: The keyword recognition module uses the Latent Dirichlet Allocation (LDA) method to construct the protocol keyword model of the protocol to be analyzed. The output of the modeling phase is a protocol keyword model of the analyzed network protocol.

训练阶段由数据采集、数据报文n-gram产生、特征提取、分类器学习四个模块构成。其流程图如图2所示，具体说明如下：The training phase consists of four modules: data collection, data packet n-gram generation, feature extraction, and classifier learning. Its flow chart is shown in Figure 2, and the specific description is as follows:

1.数据采集：同建模阶段步骤1操作。1. Data collection: Same as step 1 in the modeling phase.

2.数据报文n-gram产生：同建模阶段步骤2操作。2. Generation of datagram n-grams: same as step 2 in the modeling phase.

3.特征提取：对网络数据报文依照建模阶段步骤3得到的协议关键字模型，进行分类特征提取。该步骤计算数据报文中不同关键字出现的概率，并依此从而形成该数据报文的K维度特征向量。3. Feature extraction: perform classification feature extraction on network data packets according to the protocol keyword model obtained in step 3 of the modeling stage. This step calculates the occurrence probability of different keywords in the data message, and accordingly forms the K-dimensional feature vector of the data message.

4.分类器学习：利用有监督学习方法，依照步骤3报文特征提取模块得到的分类特征，构建所分析应用协议的二值分类器。4. Classifier learning: use a supervised learning method to construct a binary classifier for the analyzed application protocol according to the classification features obtained by the packet feature extraction module in step 3.

分类阶段由数据报文n-gram产生、特征提取、分类器三个模块构成。其流程图如图3所示，具体说明如下：The classification stage consists of three modules: data packet n-gram generation, feature extraction, and classifier. Its flow chart is shown in Figure 3, and the specific description is as follows:

1.数据报文n-gram产生：同训练阶段步骤2。1. Data packet n-gram generation: same as step 2 in the training phase.

2.特征提取：同训练阶段步骤3。2. Feature extraction: Same as step 3 in the training phase.

3.分类器：根据步骤1、步骤2得到的分类特征向量和训练阶段步骤4得到的分类模型对未标识的网络数据报文进行协议类别判定。输出结果为两类：一类是属于目标协议的网络数据报文，另一类是非目标协议的网络数据报文。3. Classifier: according to the classification feature vectors obtained in steps 1 and 2 and the classification model obtained in step 4 of the training phase, the protocol category is determined for unidentified network data packets. There are two types of output results: one is the network data message belonging to the target protocol, and the other is the network data message of the non-target protocol.

而整个方法的创新点在于协议关键字模型的构建过程，它可以分为以下几个步骤，图4给出了基于Latent Dirichlet Allocation(LDA)方法的协议关键字识别以及关键字模型构建的流程图。The innovation of the whole method lies in the construction process of the protocol keyword model, which can be divided into the following steps. Figure 4 shows the flow chart of protocol keyword identification and keyword model construction based on the Latent Dirichlet Allocation (LDA) method .

协议关键字模型构建过程的输入是属于某种特定应用协议的消息报文集合。L为协议关键字模型构建过程的中止条件。协议关键字模型构建过程的输出结果是所分析网络协议的协议关键字模型。本方法基于Latent Dirichlet Allocation(LDA)方法来构建协议关键字模型，其具体实施步骤如下：The input of the protocol keyword model construction process is a collection of message packets belonging to a specific application protocol. L is the termination condition of the protocol keyword model construction process. The output of the protocol keyword model building process is the protocol keyword model of the analyzed network protocol. This method builds the protocol keyword model based on the Latent Dirichlet Allocation (LDA) method, and its specific implementation steps are as follows:

1.首先，为包含有M个数据报文的集合D中的所有n-gram分配一个随机的关键字索引号这里w_(m,i)代表数据报文m中，第i个n-gram，z_(m,i)是该n-gram的关键字索引号，N_m是数据报文m中n-gram元素的个数；1. First, for all n-grams in the set D containing M datagrams Assign a random keyword index number Here w _{(m, i)} represents the i-th n-gram in the data message m, z _{(m, i)} is the keyword index number of the n-gram, and N _m is the n-gram element in the data message m the number of

2.接下来，用代表除z_(m,i)以外的所有其他n-gram的关键字索引号。在数值保持不变的情况下，根据后验概率分布为n-gram w_(m,i)通过采样的方法产生一个新的关键字索引号数值z_(m,i)。其中α和β是给定的超参数，代表n-gram字典中元素t分配给关键字k的次数，代表消息报文m中关键字k出现的次数。W代表n-gram字典中n-gram元素的个数。2. Next, use Keyword index numbers representing all other n-grams except z _(m,i) . exist When the value remains constant, according to the posterior probability distribution Generate a new keyword index value z _(m,i) for n-gram w _(m,i) by sampling. where α and β are given hyperparameters, Represents the number of times element t in the n-gram dictionary is assigned to key k, Represents the number of occurrences of keyword k in message message m. W represents the number of n-gram elements in the n-gram dictionary.

3.根据Gibbs采样方法得到的z_(m,i)数值，对后验概率分布中的过期数值进行更新；3. According to the value of z _{(m, i)} obtained by the Gibbs sampling method, the expired value in the posterior probability distribution is updated;

4.对数据集合中的所有的元祖(m,i)都重复上述的采样操作。若达到Gibbs采样收敛条件L，则算法中止，返回最终的关键字索引号否则重复步骤1-3。4. Repeat the above sampling operation for all tuples (m, i) in the data set. If the Gibbs sampling convergence condition L is reached, the algorithm is terminated and the final keyword index number is returned Otherwise repeat steps 1-3.

5.利用通过步骤1-4得到的关键字索引号构建协议关键字模型5. Construct a protocol keyword model using the keyword index number obtained through steps 1-4

结合上述基于语义敏感的网络协议识别方法，本发明同时公开了一种基于语义敏感的网络协议识别系统。本系统主要由建模单元、训练单元和分类单元三个部分构成，分别对应由建模阶段、训练阶段和分类阶段，系统图架构如图5所示。Combining with the above-mentioned network protocol identification method based on semantic sensitivity, the present invention also discloses a network protocol identification system based on semantic sensitivity. The system is mainly composed of three parts: modeling unit, training unit and classification unit, which correspond to the modeling phase, training phase and classification phase respectively. The system diagram architecture is shown in Figure 5.

1.建模单元：以特定应用协议的网络数据报文集合作为输入，利用LatentDirichlet Allocation模型构建所分析协议的关键字模型。该单元的输出结果是所分析协议的协议关键字模型。1. Modeling unit: take the network data packet collection of a specific application protocol as input, and use the LatentDirichlet Allocation model to construct the keyword model of the analyzed protocol. The output of this unit is a protocol keyword model of the analyzed protocol.

2.训练单元：依照建模单元得到的协议关键字模型提取数据报文的分类特征信息。以特征提取模块获得的关键字特征向量作为输入，利用有监督机器学习方法对离线训练数据集学习训练，从而获得所分析协议的分类模型。2. Training unit: According to the protocol keyword model obtained by the modeling unit, the classification characteristic information of the data message is extracted. Taking the keyword feature vector obtained by the feature extraction module as input, the supervised machine learning method is used to learn and train the offline training data set, so as to obtain the classification model of the analyzed protocol.

3.分类单元：依照建模单元得到的协议关键字模型提取数据报文的分类特征信息，利用训练单元输出的协议检测模型(即上述分类模型)，对待测网络数据报文的协议属性做出判别。输出结果为两类：一类是属于目标协议的网络数据报文，另一类是非目标协议的网络数据报文。3. Classification unit: extract the classification feature information of the data message according to the protocol keyword model obtained by the modeling unit, and use the protocol detection model (i.e. the above classification model) output by the training unit to make the protocol attribute of the network data message to be tested. judge. There are two types of output results: one is the network data message belonging to the target protocol, and the other is the network data message of the non-target protocol.

在验证实验中，本发明对DNS协议和FTP协议在n-gram元素总个数W为不同取值的情况下分别进行实验，对比其在不同有监督学习算法下的准确率，召回率和F测度。给定系统要分析的某种应用协议，本发明首先定义以下三种数据集合：In the verification experiment, the present invention conducts experiments respectively under the situation that the total number W of n-gram elements is different values to the DNS protocol and the FTP protocol, and compares its accuracy rate under different supervised learning algorithms, the recall rate and F measure. Given a certain application protocol to be analyzed by the system, the present invention first defines the following three data sets:

●True Positives(TP):被系统识别为某协议的网络数据包，且确实是属于该协议的网络数据包集合。●True Positives (TP): It is recognized by the system as a network data packet of a certain protocol, and it is indeed a set of network data packets belonging to the protocol.

●False Positives(FP):被系统识别为某协议的网络数据包，但并不属于该协议的网络数据包集合。●False Positives (FP): It is recognized by the system as a network packet of a certain protocol, but it does not belong to the set of network packets of the protocol.

●False Negatives(FN):被系统识别为非某协议的网络数据包，但其实是属于该协议的网络数据包集合。●False Negatives (FN): It is recognized by the system as a network data packet not of a certain protocol, but it is actually a collection of network data packets belonging to the protocol.

●True Negatives(TN)：被系统识别为非某协议的网络数据包，且确实不属于该协议的网络数据包集合。●True Negatives (TN): The network data packets identified by the system as not belonging to a certain protocol, and indeed do not belong to the set of network data packets of the protocol.

基于上述三种数据集合，本发明采用机器学习领域中通常使用的准确率(precision)，召回率(recall)和F测度(F-Measure)三种评价指标来对系统的有效性和可靠性进行评价。三种评价指标定义如下：Based on the above three data sets, the present invention uses three evaluation indicators commonly used in the field of machine learning (precision), recall (recall) and F-Measure (F-Measure) to evaluate the effectiveness and reliability of the system Evaluation. The three evaluation indicators are defined as follows:

由于准确率与召回率分别描述系统性能的两个方面，单一使用准确率和召回率作为评价指标具有局限性，因此，本文选用F测度指标将这两个指标进行综合考虑，从而选择最优方案。基于语义敏感的网络协议识别方法在DNS协议和FTP协议的实验结果如下表所示。Since the accuracy rate and the recall rate describe two aspects of system performance respectively, the single use of the accuracy rate and the recall rate as an evaluation index has limitations. Therefore, this paper chooses the F measure index to comprehensively consider these two indicators, so as to select the optimal solution. . The experimental results of the semantic-sensitive network protocol recognition method in the DNS protocol and the FTP protocol are shown in the table below.

表1：DNS协议实验结果Table 1: Experimental results of DNS protocol

表2：FTP协议实验结果Table 2: Experimental results of FTP protocol

表1展示了DNS协议的实验结果。本发明注意到DNS协议的准确率数值，在不同参数设定下，其变化范围在94.16％～99.74％。召回率数值，在不同参数设定下，其变化范围在98.21％～99.85％。对于DNS协议而言，本发明发现其达到最好实验结果是C4.5决策树，对应的W数值为1000。Table 1 shows the experimental results of the DNS protocol. The present invention pays attention to the numerical value of the accuracy rate of the DNS protocol, and the variation range is 94.16% to 99.74% under different parameter settings. The value of the recall rate varies from 98.21% to 99.85% under different parameter settings. For the DNS protocol, the present invention finds that the best experimental result is the C4.5 decision tree, and the corresponding W value is 1000.

表2展示了FTP协议的实验结果。本发明注意到FTP协议的准确率数值，在不同参数设定下，其变化范围在97.20％～99.56％。召回率数值，在不同参数设定下，其变化范围在87.16％～97.28％。对于FTP协议而言，本发明发现其达到最好的实验结果是使用C4.5决策树，对应的W数值为1500。Table 2 shows the experimental results of the FTP protocol. The present invention pays attention to the value of the accuracy rate of the FTP protocol. Under different parameter settings, the variation range is 97.20% to 99.56%. The value of the recall rate varies from 87.16% to 97.28% under different parameter settings. For the FTP protocol, the present invention finds that the best experimental result is to use the C4.5 decision tree, and the corresponding W value is 1500.

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those of ordinary skill in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. The scope of protection should be determined by the claims.

Claims

1. A semantically sensitive network protocol recognition method, characterized in that, comprises a modeling stage, a training stage and a classification stage;

In the modeling stage, the network data packet collection of a specific application protocol is used as input, and the keyword model of the analyzed protocol is constructed using the Latent DirichletAllocation method;

In the training stage, according to the protocol keyword model obtained in the modeling stage, the classification feature information of the data message is extracted, and the obtained keyword feature vector is used as input, and the offline training data set is learned and trained by a supervised machine learning method, so as to obtain all Classification models for analysis protocols;

In the classification stage, the classification feature information of the data packet is extracted according to the protocol keyword model obtained in the modeling stage, and the protocol classification model output in the training stage is used to judge the protocol attributes of the network data packet to be tested, and determine whether it belongs to the target Protocol network data packets;

The modeling phase includes the following steps:

a) Collect network data packets belonging to a specific application protocol, thereby dividing the network data packets into two categories: one is a collection of data packets belonging to the application protocol to be analyzed; the other is a collection of data packets that do not belong to the application protocol to be analyzed datagram collection;

b) Utilizing the n-gram model to convert the network data message into a network data message with an n-gram element as a basic unit; the n-gram model is a subsequence of n consecutive elements of a given sequence;

c) Construct the protocol keyword model of the protocol to be analyzed by using the Latent Dirichlet Allocation method;

The steps to construct a protocol keyword model using the Latent Dirichlet Allocation method include:

1) For all n-grams in the set D containing M data packets Assign a random keyword index number Here w _{(m, i)} represents the i-th n-gram in the data message m, z _{(m, i)} is the keyword index number of the n-gram, and N _m is the n-gram element in the data message m the number of

2) with Represents the keyword index number of all other n-grams except z _(m,i) , in When the value remains constant, according to the posterior probability distribution Generate a new keyword index value z _(m,i) for n-gram w _(m,i) by sampling; where α and β are given hyperparameters, Represents the number of times element t in the n-gram dictionary is assigned to key k, Represents the number of occurrences of keyword k in the message message m, and W represents the number of n-gram elements in the n-gram dictionary;

3) According to the value of z _{(m, i)} obtained by the Gibbs sampling method, the expired value in the posterior probability distribution is updated;

4) Repeat the above sampling operation for all tuples (m, i) in the data set. If the Gibbs sampling convergence condition L is reached, the algorithm stops and returns the final keyword index number Otherwise repeat steps 1) to 3);

5) Construct a protocol keyword model using the keyword index number obtained through steps 1) to 4)

Where K represents the number of protocol keywords,

2. The method according to claim 1, characterized in that: when generating the data message n-gram, only select the first W n-gram elements with higher statistical frequency, and form its n-gram dictionary.

3. The method according to claim 1, wherein the specific steps of the training phase include:

1) Data collection, the same as the operation of step 1) in the modeling stage;

2) Generation of data message n-gram, the same as the operation of step 2) in the modeling stage;

3) Carry out classification feature extraction to the network data message according to the protocol keyword model obtained in step 3) of the modeling stage;

4) Using a supervised learning method to construct a binary classifier for the analyzed application protocol according to the extracted classification features.

4. The method according to claim 3, wherein the specific steps of the modeling phase include:

1) Data packet n-gram is generated, same as step 2 in the training phase;

2) feature extraction, same as training stage step 3);

3) According to the classification feature vector obtained in step 1) and step 2) and the classification model obtained in step 4) of the training phase, the protocol type is determined for the unmarked network data message.

5. The method according to claim 1, wherein the network protocol to be tested is a connection-oriented protocol and/or a connectionless protocol.

6. A semantically sensitive network protocol recognition system based on the method according to claim 1, characterized in that, comprising:

The modeling unit uses the network data packet collection of the specific application protocol as input, and uses the Latent DirichletAllocation model to construct the keyword model of the analyzed protocol;

The training unit extracts the classification feature information of the data message according to the protocol keyword model obtained by the modeling unit, and takes the obtained keyword feature vector as input, and uses the supervised machine learning method to learn and train the offline training data set, so as to obtain the analyzed The classification model of the agreement;

The classification unit extracts the classification feature information of the data message according to the protocol keyword model obtained by the modeling unit, uses the protocol classification model output by the training unit to make a judgment on the protocol attribute of the network data message to be tested, and judges whether it belongs to the target protocol network data packets.