CN104376261B

CN104376261B - A kind of method of the automatic detection malicious process under evidence obtaining scene

Info

Publication number: CN104376261B
Application number: CN201410705875.1A
Authority: CN
Inventors: 伏晓; 端恒; 端一恒; 骆斌
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-11-27
Filing date: 2014-11-27
Publication date: 2017-04-05
Anticipated expiration: 2034-11-27
Also published as: CN104376261A

Abstract

The invention provides a method for automatically detecting malicious processes based on process dynamic link library data in the scene of forensics. The method includes the following steps: 1) establishing a mapping method from dynamic link library data to N-tuples; 2) calculating a relatively optimal key dynamic link library set according to a greedy algorithm; 3) establishing a recognition model by using a hidden naive Bayesian method and test. Compared with the existing malicious software detection method, the present invention realizes the automatic identification of the malicious software process in many unknown processes without the relevant experience of the malicious software signature code library, and can be used in the training set and the recognition set When the source is inconsistent, the confirmation set is used for correction. In addition, the present invention can process raw dynamic link library data from different sources. The present invention is particularly suitable for scenarios of no prior knowledge and large-scale automated malicious process detection.

Description

A Method for Automatic Detection of Malicious Processes in Forensic Scenarios

技术领域technical field

本发明涉及恶意进程识别与计算机取证领域，且特别是有关于一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法。The invention relates to the fields of malicious process identification and computer forensics, and in particular relates to a method for automatically detecting malicious processes based on process dynamic link library data in the forensics scene.

背景技术Background technique

随着国民经济和社会的快速发展，我国各行各业的信息化水平也在不断提高。在全民信息化的背景下，计算机恶意程序的数量越来越大，出现的频率也越来越高，而对这些恶意程序进行高效率的自动化检测就显得尤为重要。目前该领域还是更多地依赖于恶意进程的特征码，更多依赖于人的经验，关注不依赖于特征码库来自动识别的方法还很少。With the rapid development of the national economy and society, the level of informatization in all walks of life in our country is also constantly improving. In the context of the informatization of the whole people, the number of computer malicious programs is increasing, and the frequency of occurrence is also increasing, and it is particularly important to perform efficient automatic detection of these malicious programs. At present, this field still relies more on signatures of malicious processes, more on human experience, and there are few methods of automatic identification that do not rely on signature libraries.

发明内容Contents of the invention

本发明目的在于，提供一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法，实现在无恶意软件特征码库无相关经验的情况下在众多未知进程中对恶意软件进程的自动识别，能够处理来自不同来源的原始动态链接库数据。本发明特别适用于无先验知识和大规模自动化恶意进程检测的场景。The purpose of the present invention is to provide a method for automatically detecting malicious processes based on process-based dynamic link library data in a forensics scenario, so as to realize the detection of malicious software processes among many unknown processes without any relevant experience in the malware signature code library. Automatic identification, capable of processing raw DLL data from different sources. The present invention is particularly suitable for scenarios of no prior knowledge and large-scale automated malicious process detection.

为达成上述目的，本发明提出一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法，方法包括下列步骤：In order to achieve the above object, the present invention proposes a method for automatically detecting malicious processes based on process-based dynamic link library data in a forensics scenario. The method includes the following steps:

1)建立从动态链接库数据到N元元组的映射方式；1) Establish a mapping method from the dynamic link library data to the N-tuple;

定义1：一个N元元组是一个长度为N的由0或者1组成的序列，这里N为非负整数；Definition 1: An N-tuple is a sequence of length N consisting of 0 or 1, where N is a non-negative integer;

定义2：标志位是添加到N元元组末尾的一个特殊位，它用来表示这个元组所代表的进程是否是恶意进程，它被用于恶意软件的识别过程；Definition 2: The flag bit is a special bit added to the end of the N-tuple, which is used to indicate whether the process represented by this tuple is a malicious process, and it is used in the identification process of malware;

为了能通过识别算法来快速分析和处理动态链接库数据，需要将每个进程的动态链接库数据映射成一个数据结构，即定义1中的N元元组；动态链接库集是映射的标准，针对一个包含N个动态链接库的动态链接库集，对应的数据结构为一个N+1元元组、包括一个定义2中的标志位，待识别集中的进程对应的动态链接库数据结构没有这一位，为N元；In order to quickly analyze and process the dynamic link library data through the recognition algorithm, it is necessary to map the dynamic link library data of each process into a data structure, that is, the N-tuple in definition 1; the dynamic link library set is the standard for mapping, For a dynamic link library set containing N dynamic link libraries, the corresponding data structure is an N+1 tuple, including a flag bit in definition 2, and the dynamic link library data structure corresponding to the process in the set to be identified does not have this One, for N yuan;

映射方式陈述如下：The mapping method is stated as follows:

a.将元组每一位都设为0；a. Set each bit of the tuple to 0;

b.遍历作为标准的动态链接库集，对于每个动态链接库，在进程的动态链接库数据中搜索，如果存在，则把进程所对应的记录的该动态链接库所对应的位置设为1；b. traverse as a standard dynamic link library set, for each dynamic link library, search in the dynamic link library data of the process, if it exists, set the corresponding position of the dynamic link library of the record corresponding to the process to 1 ;

c.如果该进程属于训练集或者确认集，是否为恶意进程已知，是则设为1，如果是待识别集即测试集，则去掉该标志位；c. If the process belongs to the training set or confirmation set, whether it is known as a malicious process, it is set to 1, if it is a set to be identified, that is, a test set, then remove the flag;

2)根据贪婪算法计算相对最优的关键动态链接库集；2) Calculate the relatively optimal key dynamic link library set according to the greedy algorithm;

在建立了从N个动态链接库数据到N元元组的映射方式之后，需要选择动态链接库来形成动态链接库集，而这些被选中的动态链接库被称之为关键动态链接库；这个集的选择会对建立的检测模型产生影响进而对识别准确率产生影响；After the mapping from N dynamic link library data to N-tuples is established, dynamic link libraries need to be selected to form a dynamic link library set, and these selected dynamic link libraries are called key dynamic link libraries; The selection of the set will affect the established detection model and then affect the recognition accuracy;

定义3：理论最优动态链接库集为这样一个集，没有任何其他集使用同一算法在同一训练集上建模之后在同一测试集的表现优于该集，这个理论最优动态链接库集依赖于训练集、测试集以及算法的选择；Definition 3: The theoretical optimal dynamic link library set is such a set that no other set performs better than this set in the same test set after using the same algorithm to model on the same training set. This theoretical optimal dynamic link library set depends on In the selection of training set, test set and algorithm;

定义4：相对最优动态链接库集往往不等于理论最优动态链接库集，但在类似训练集和测试集的情况下，都有着相对较好的表现，此外，该集能通过可控的运算复杂度在一定步骤内获得；Definition 4: The relatively optimal dynamic link library set is often not equal to the theoretical optimal dynamic link library set, but in the case of similar training set and test set, it has relatively good performance. In addition, this set can pass controllable The computational complexity is obtained within certain steps;

在选择关键动态链接库时，由于关键动态链接库是用来描述各个分类中进程的公共属性的，进程相关的动态链接库不能被选为关键动态链接库；在没有先验知识的情况下，首先需要考察的是windows的系统动态链接库，不直接作为关键动态链接库，否则建立模型时会发生“维度爆炸”；考虑到各个动态链接库在样本中出现的概率并不平均，则通过对样本进行统计的方式来初始化关键动态链接库集，统计出在训练集的常规软件进程和恶意软件进程中均出现的动态链接库，它们组合的集为I；When selecting key dynamic link libraries, since key dynamic link libraries are used to describe the common attributes of processes in each category, process-related dynamic link libraries cannot be selected as key dynamic link libraries; without prior knowledge, The first thing to examine is the system dynamic link library of windows, which is not directly used as the key dynamic link library, otherwise a "dimension explosion" will occur when building the model; considering that the probability of each dynamic link library in the sample is not even, the The mode that sample carries out statistics initializes key dynamic link library set, counts out the dynamic link library that all occurs in the routine software process of training set and malicious software process, and the set of their combination is I;

为了提高识别的准确率，往往需要较多的训练数据；如果需要鉴别的恶意进程的动态链接库数据和用于训练的动态链接库来源不同，需要首先采用枚举动态链接库dll出现比率上下限组合，并在验证集、验证反馈的方式来获得一个在组合测试中最优的动态链接库集作为开始，验证集必须和待鉴别数据来自同一来源；从而将一个数量过于巨大的可能性组合映射到了一个有限可计算的区间中；In order to improve the accuracy of recognition, more training data is often needed; if the dynamic link library data of the malicious process to be identified is different from the source of the dynamic link library used for training, it is necessary to first enumerate the upper and lower limits of the occurrence ratio of the dynamic link library dll Combination, and in the way of verification set and verification feedback to obtain an optimal dynamic link library set in the combination test as a start, the verification set must come from the same source as the data to be identified; thus mapping a number of possible combinations that are too large into a finitely computable interval;

具体操作陈述如下：The specific operation statement is as follows:

a.以一个设定的步长(步长取0.5％)和一个设定的方式(设下限从0到49.5％共100个，上限从50％到100％共101个)来枚举，这样一共有10100种上下限组合；a. Enumerate with a set step size (the step size is 0.5%) and a set method (the lower limit is 100 from 0 to 49.5%, and the upper limit is 101 from 50% to 100%), so There are a total of 10100 upper and lower limit combinations;

b.对于每一个组合，都对应一个关键动态链接库集，计算方式为如果一个动态链接库在训练集的常规软件进程和恶意软件进程中出现都在概率在该上下限组合之间，则将该动态链接库放入该关键动态链接库集；b. For each combination, it corresponds to a key dynamic link library set, and the calculation method is that if a dynamic link library appears in the normal software process and malware process in the training set, the probability is between the upper and lower limit combinations, then the The dynamic link library is put into the key dynamic link library set;

c.对这些组合对应的关键动态链接库集一一在验证集上测试，找出相对最优关键动态链接库集S：c. Test the key dynamic link library sets corresponding to these combinations one by one on the verification set to find out the relatively optimal key dynamic link library set S:

采用贪婪算法和验证数据结合的方式来优化关键动态链接库集；通过逐步添加动态链接库到备选关键动态链接库集进行建模并使用验证集来反馈结果，逐步找到一个相对最优的关键动态链接库集；The combination of greedy algorithm and verification data is used to optimize the key dynamic link library set; by gradually adding the dynamic link library to the candidate key dynamic link library set for modeling and using the verification set to feedback the results, gradually find a relatively optimal key dynamic link library set;

具体操作陈述如下：The specific operation statement is as follows:

a.设相对最优关键动态链接库集S中有n个元素，循环将集I中S没有的元素加入集S，每次加入一个，建模得到在验证集上的结果，拿出，加入下一个，重复，直到循环结束；a. Assuming that there are n elements in the relatively optimal key dynamic link library set S, add elements not in S in set I to set S in a loop, adding one at a time, and get the result on the verification set by modeling, take it out, and add Next, repeat until the end of the loop;

b.统计得到在验证集上结果最好的拥有n+1个元素的集合T，用它代替S，重复a；b. Statistically obtain the set T with n+1 elements that has the best results on the verification set, replace S with it, and repeat a;

c.该过程结束直到验证集上的结果不再改善，这时得到最终的集合S；c. The process ends until the results on the verification set are no longer improved, and the final set S is obtained at this time;

3)采用隐藏朴素贝叶斯方法建立识别模型并进行检测：3) Use the hidden naive Bayesian method to establish a recognition model and perform detection:

在获得了相对最优关键动态链接库集之后，需要确定建模算法；由于动态链接库之间耦合度较高，存在较强的相互依赖性，而隐藏朴素贝叶斯方法(hidden naive Bayes)适合处理较高的耦合度；在隐藏朴素贝叶斯方法中，每个属性存在一个隐藏的父节点来表达其他属性对于该属性的影响；After obtaining the relatively optimal set of key dynamic link libraries, it is necessary to determine the modeling algorithm; due to the high degree of coupling between the dynamic link libraries, there is a strong interdependence, and the hidden naive Bayesian method (hidden naive Bayes) Suitable for dealing with high coupling; in the hidden naive Bayesian method, each attribute has a hidden parent node to express the influence of other attributes on the attribute;

在确定了隐藏朴素贝叶斯方法后，使用训练集建立模型，将待识别集中的进程按照1)中的方式转化为元组；隐藏朴素贝叶斯方法通过计算条件概率来最终确定识别集中的进程的标志位；标志位为1代表恶意进程，0代表常规进程。After the hidden naive Bayesian method is determined, the training set is used to build a model, and the processes in the recognition set are converted into tuples according to the method in 1); the hidden naive Bayesian method finally determines the identification set by calculating the conditional probability The flag bit of the process; the flag bit is 1 for a malicious process, and 0 for a regular process.

进一步，其中上述步骤1)的具体步骤如下：Further, wherein the specific steps of the above step 1) are as follows:

步骤1)-1：起始状态；Step 1)-1: initial state;

步骤1)-2：将元组每一位都设为0；Step 1)-2: Set each bit of the tuple to 0;

步骤1)-3：遍历动态链接库集，对于每个动态链接库，在目标进程的动态链接库数据中搜索，如果存在，则把目标进程所对应的记录中该动态链接库所对应的位置设为1；Step 1)-3: traverse the dynamic link library set, for each dynamic link library, search in the dynamic link library data of the target process, if it exists, then put the position corresponding to the dynamic link library in the record corresponding to the target process set to 1;

步骤1)-4：遍历每个进程，判断该进程是否属于训练集或者确认集，如果是，进入1)-5，否则是待识别集，进入1)-6；Step 1)-4: Traversing each process, judging whether the process belongs to the training set or confirmation set, if yes, enter 1)-5, otherwise it is a set to be identified, enter 1)-6;

步骤1)-5：，如果该进程为恶意进程，设定标志位为1，遍历完毕进入1)-7，否则继续1)-4；Step 1)-5: If the process is a malicious process, set the flag to 1, go to 1)-7 after traversing, otherwise continue to 1)-4;

步骤1)-6：去掉该标志位，遍历完毕进入1)-7，否则继续1)-4；Step 1)-6: Remove the flag, and go to 1)-7 after traversing, otherwise continue to 1)-4;

步骤1)-7：建立从动态链接库数据到N元元组的映射方式完毕。Steps 1)-7: The establishment of a mapping method from dynamic link library data to N-tuples is completed.

进一步，其中上述步骤2)的具体步骤如下：Further, wherein the specific steps of the above step 2) are as follows:

步骤2)-1：起始状态；Step 2)-1: initial state;

步骤2)-2：统计在训练集的常规软件进程和恶意软件进程中均出现的动态链接库，他们组合的集为I；Step 2)-2: count the dynamic link libraries that all occur in the conventional software process and the malicious software process of the training set, and the set of their combination is I;

步骤2)-3：以一个设定的步长(不妨设定为0.5％)和一个设定的方式(不妨设下限为从0到49.5％共100个，上限从50％到100％共101个)来枚举所有的上下限组合；Step 2)-3: With a set step size (maybe set as 0.5%) and a set way (may set the lower limit as 100 from 0 to 49.5%, and the upper limit as 101 from 50% to 100%) ) to enumerate all upper and lower limit combinations;

步骤2)-4：遍历2)-3中的所有组合，对于每一个组合，枚举I中的所有动态链接库；Step 2)-4: traverse all combinations in 2)-3, for each combination, enumerate all dynamic link libraries in I;

步骤2)-5：如果该动态链接库在训练集的常规软件进程和恶意软件进程中出现都在概率在该上下限组合之间，则将该动态链接库放入该集，如果枚举完毕进入2)-6，否则返回2)-5；Step 2)-5: If the dynamic link library appears in the normal software process and the malicious software process of the training set, the probability is between the upper and lower limit combination, then the dynamic link library is put into the set, if the enumeration is complete Enter 2)-6, otherwise return 2)-5;

步骤2)-6：枚举每一个上下限组合对应的集在验证集上进行测试，记录结果；Step 2)-6: Enumerate the set corresponding to each combination of upper and lower limits, test on the verification set, and record the results;

步骤2)-7：如果枚举完毕，找出最优集S，否则返回2)-6；Step 2)-7: If the enumeration is complete, find the optimal set S, otherwise return to 2)-6;

步骤2)-8：设集S中有n个元素，循环将集I中S没有的元素加入集S，每次加入一个，得到集X；Step 2)-8: Assuming that there are n elements in the set S, add elements not in S in the set I to the set S in a loop, adding one at a time to obtain the set X;

步骤2)-9：在X上建模，得到在验证集上的结果，拿出加入的元素，恢复S；Step 2)-9: Model on X, get the result on the verification set, take out the added elements, and restore S;

步骤2)-10：如果循环结束，进入2)-11，否则回到2)-8；Step 2)-10: If the loop ends, go to 2)-11, otherwise go back to 2)-8;

步骤2)-11：统计得到在验证集上结果最好的拥有n+1个元素的集合T，用它代替S；Step 2)-11: Statistically obtain the set T with n+1 elements that has the best result on the verification set, and use it to replace S;

步骤2)-12：如果在2)-11过程中在验证集上的结果不再改善，终止，得到相对最优的动态链接库集，否则返回2)-8；Step 2)-12: If the result on the verification set is no longer improved in the process of 2)-11, terminate and obtain a relatively optimal dynamic link library set, otherwise return to 2)-8;

步骤2)-13：根据贪婪算法计算相对最优的关键动态链接库集完毕。Steps 2)-13: Completing the calculation of the relatively optimal key dynamic link library set according to the greedy algorithm.

进一步，其中上述步骤3)的具体步骤如下：Further, wherein the specific steps of the above step 3) are as follows:

步骤3)-1：起始状态；Step 3)-1: initial state;

步骤3)-2：将待识别集和选定的训练集中的进程按照1)中的方式转化为元组；Step 3)-2: convert the processes in the set to be identified and the selected training set into tuples according to the method in 1);

步骤3)-3：使用隐藏朴素贝叶斯方法在训练集上建立识别模型；Step 3)-3: use the hidden naive Bayesian method to establish a recognition model on the training set;

步骤3)-4：隐藏朴素贝叶斯方法通过计算条件概率来最终确定识别集中的进程的标志位；Step 3)-4: The hidden naive Bayesian method finally determines the flag bits of the processes in the identification set by calculating the conditional probability;

步骤3)-5：如果标志位为1进入3)-6，否则进入3)-7；Step 3)-5: If the flag is 1, enter 3)-6, otherwise enter 3)-7;

步骤3)-6：将该元组对应的进程确定为恶意进程；Step 3)-6: determine the process corresponding to the tuple as a malicious process;

步骤3)-7：将该元组对应的进程确定为常规进程；Step 3)-7: determine the process corresponding to the tuple as a regular process;

步骤3)-8：采用隐藏朴素贝叶斯方法建立识别模型并进行检测完毕。Steps 3)-8: The hidden naive Bayesian method is used to establish a recognition model and the detection is completed.

本发明的有益效果，提供了一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法，与现有的恶意软件检测方法相比，本发明实现了在无恶意软件特征码库无相关经验的情况下在众多未知进程中对恶意软件进程的自动识别，并能够在训练集和待识别集来源不一致时采用确认集进行校正。此外，本发明可以处理来自不同来源的原始动态链接库数据。本发明特别适用于无先验知识和大规模自动化恶意进程检测的场景。实践证明在常规应用场景下，本方法能达到超过百分之九十以上的准确率而时间消耗仅为数秒。The beneficial effects of the present invention provide a method for automatically detecting malicious processes based on process-based dynamic link library data in a forensics scenario. Automatically identify malware processes among many unknown processes without relevant experience, and can use the confirmation set to correct when the source of the training set and the source of the recognition set are inconsistent. Furthermore, the present invention can process raw dynamic link library data from different sources. The present invention is particularly suitable for scenarios of no prior knowledge and large-scale automated malicious process detection. Practice has proved that in common application scenarios, this method can achieve an accuracy rate of more than 90% and the time consumption is only a few seconds.

附图说明Description of drawings

图1为本发明实施例的一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的流程图。FIG. 1 is a flow chart of an embodiment of the present invention for automatically detecting a malicious process based on process dynamic link library data in a forensics scenario.

图2为图1中建立从动态链接库数据到N元元组的映射方式的流程图。FIG. 2 is a flow chart of establishing a mapping method from dynamic link library data to N-tuples in FIG. 1 .

图3为图1中根据贪婪算法计算相对最优的关键动态链接库集的流程图。FIG. 3 is a flow chart of calculating a relatively optimal key dynamic link library set according to the greedy algorithm in FIG. 1 .

图4为图1中采用隐藏朴素贝叶斯方法建立识别模型并进行检测的流程图。FIG. 4 is a flow chart of establishing a recognition model and performing detection using the hidden naive Bayesian method in FIG. 1 .

具体实施方式detailed description

为了更了解本发明的技术内容，特举具体实施例并配合所附图式说明如下。In order to better understand the technical content of the present invention, specific embodiments are given together with the attached drawings for description as follows.

图1为本发明实施例的一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法的流程图。FIG. 1 is a flowchart of a method for automatically detecting a malicious process based on process dynamic link library data in a forensics scenario according to an embodiment of the present invention.

一种在取证场景下基于进程的动态链接库数据自动检测恶意进程的方法，其特征在于，包括下列步骤：A method for automatically detecting malicious processes based on process dynamic link library data in a forensics scenario, characterized in that it comprises the following steps:

S101建立从动态链接库数据到N元元组的映射方式。S101 Establish a mapping method from dynamic link library data to N-tuples.

映射方式陈述如下：The mapping method is stated as follows:

a.将元组每一位都设为0；a. Set each bit of the tuple to 0;

S103根据贪婪算法计算相对最优的关键动态链接库集。S103 Calculate a relatively optimal key dynamic link library set according to a greedy algorithm.

具体操作陈述如下：The specific operation statement is as follows:

S105采用隐藏朴素贝叶斯方法建立识别模型并进行检测。S105 adopts the hidden naive Bayesian method to establish a recognition model and perform detection.

在确定了隐藏朴素贝叶斯方法后，使用训练集建立模型，将待识别集中的进程按照S101中的方式转化为元组；隐藏朴素贝叶斯方法通过计算条件概率来最终确定识别集中的进程的标志位；标志位为1代表恶意进程，0代表常规进程。After the hidden naive Bayesian method is determined, the training set is used to build a model, and the processes in the to-be-recognized set are converted into tuples according to the method in S101; the hidden naive Bayesian method finally determines the processes in the identified set by calculating the conditional probability The flag bit; the flag bit is 1 for a malicious process, and 0 for a regular process.

图2为建立从动态链接库数据到N元元组的映射方式的流程图。FIG. 2 is a flow chart of establishing a mapping method from dynamic link library data to N-tuples.

映射方式陈述如下：The mapping method is stated as follows:

a.将元组每一位都设为0；a. Set each bit of the tuple to 0;

具体步骤如下：Specific steps are as follows:

步骤1：起始状态；步骤2：将元组每一位都设为0；步骤3：遍历动态链接库集，对于每个动态链接库，在目标进程的动态链接库数据中搜索，如果存在，则把目标进程所对应的记录中该动态链接库所对应的位置设为1；步骤4：遍历每个进程，判断该进程是否属于训练集或者确认集，如果是，进入5，否则是待识别集，进入6；步骤5：，如果该进程为恶意进程，设定标志位为1，遍历完毕进入7，否则继续4；步骤6：去掉该标志位，遍历完毕进入7，否则继续4；步骤7：建立从动态链接库数据到N元元组的映射方式完毕。Step 1: initial state; step 2: set each bit of the tuple to 0; step 3: traverse the dynamic link library set, for each dynamic link library, search in the dynamic link library data of the target process, if it exists , then set the position corresponding to the dynamic link library in the record corresponding to the target process to 1; Step 4: traverse each process, and judge whether the process belongs to the training set or confirmation set, if yes, enter 5, otherwise wait for Identify the set, enter 6; step 5: if the process is a malicious process, set the flag to 1, go to 7 after traversing, otherwise go to 4; step 6: remove the flag, go to 7 after traversing, otherwise go to 4; Step 7: The establishment of the mapping method from the dynamic link library data to the N-tuple is completed.

图3为根据贪婪算法计算相对最优的关键动态链接库集的流程图。在建立了从N个动态链接库数据到N元元组的映射方式之后，需要选择动态链接库来形成动态链接库集，而这些被选中的动态链接库被称之为关键动态链接库；这个集的选择会对建立的检测模型产生影响进而对识别准确率产生影响；Fig. 3 is a flow chart of calculating a relatively optimal key dynamic link library set according to the greedy algorithm. After the mapping from N dynamic link library data to N-tuples is established, dynamic link libraries need to be selected to form a dynamic link library set, and these selected dynamic link libraries are called key dynamic link libraries; The selection of the set will affect the established detection model and then affect the recognition accuracy;

具体操作陈述如下：The specific operation statement is as follows:

具体步骤如下：Specific steps are as follows:

步骤1：起始状态；步骤2：统计在训练集的常规软件进程和恶意软件进程中均出现的动态链接库，他们组合的集为I；步骤3：以一个设定的步长(不妨设定为0.5％)和一个设定的方式(不妨设下限为从0到49.5％共100个，上限从50％到100％共101个)来枚举所有的上下限组合；步骤4：遍历3中的所有组合，对于每一个组合，枚举I中的所有动态链接库；步骤5：如果该动态链接库在训练集的常规软件进程和恶意软件进程中出现都在概率在该上下限组合之间，则将该动态链接库放入该集，如果枚举完毕进入6，否则返回5；步骤6：枚举每一个上下限组合对应的集在验证集上进行测试，记录结果；步骤7：如果枚举完毕，找出最优集S，否则返回6；步骤8：设集S中有n个元素，循环将集I中S没有的元素加入集S，每次加入一个，得到集X；步骤9：在X上建模，得到在验证集上的结果，拿出加入的元素，恢复S；步骤10：如果循环结束，进入11，否则回到8；步骤11：统计得到在验证集上结果最好的拥有n+1个元素的集合T，用它代替S；步骤12：如果在11过程中在验证集上的结果不再改善，终止，得到相对最优的动态链接库集，否则返回8；步骤13：根据贪婪算法计算相对最优的关键动态链接库集完毕。Step 1: initial state; Step 2: count the dynamic link libraries that all occur in the conventional software process and the malicious software process of the training set, and their combination is I; Step 3: with a set step size (may as well set 0.5%) and a setting method (may set the lower limit as a total of 100 from 0 to 49.5%, and the upper limit as a total of 101 from 50% to 100%) to enumerate all upper and lower limit combinations; Step 4: Traverse 3 All combinations in, for each combination, enumerate all dynamic link libraries in I; Step 5: if the dynamic link library appears in the normal software process and malware process of the training set, the probability is between the upper and lower limit combinations If the enumeration is complete, enter 6, otherwise return to 5; Step 6: Enumerate the set corresponding to each upper and lower limit combination, test it on the verification set, and record the results; Step 7: If the enumeration is complete, find the optimal set S, otherwise return to 6; Step 8: Assume that there are n elements in the set S, add the elements that are not in S in the set I to the set S in a loop, add one at a time, and get the set X; Step 9: Model on X, get the result on the verification set, take out the added elements, and restore S; Step 10: If the loop ends, go to 11, otherwise go back to 8; Step 11: Get statistics on the verification set The best set T with n+1 elements is used instead of S; Step 12: If the result on the verification set is no longer improved during the 11th process, terminate and get a relatively optimal dynamic link library set, otherwise Return to 8; step 13: the calculation of the relatively optimal key dynamic link library set is completed according to the greedy algorithm.

图4为采用隐藏朴素贝叶斯方法建立识别模型并进行检测的流程图。在获得了相对最优关键动态链接库集之后，需要确定建模算法；由于动态链接库之间耦合度较高，存在较强的相互依赖性，而隐藏朴素贝叶斯方法(hidden naive Bayes)适合处理较高的耦合度；在隐藏朴素贝叶斯方法中，每个属性存在一个隐藏的父节点来表达其他属性对于该属性的影响；Fig. 4 is a flow chart of establishing a recognition model and performing detection using the hidden naive Bayesian method. After obtaining the relatively optimal set of key dynamic link libraries, it is necessary to determine the modeling algorithm; due to the high degree of coupling between the dynamic link libraries, there is a strong interdependence, and the hidden naive Bayesian method (hidden naive Bayes) Suitable for dealing with high coupling; in the hidden naive Bayesian method, each attribute has a hidden parent node to express the influence of other attributes on the attribute;

在确定了隐藏朴素贝叶斯方法后，使用训练集建立模型，将待识别集中的进程按照图2中的方式转化为元组；隐藏朴素贝叶斯方法通过计算条件概率来最终确定识别集中的进程的标志位；标志位为1代表恶意进程，0代表常规进程。After the hidden naive Bayesian method is determined, the training set is used to build a model, and the processes in the recognition set are converted into tuples in the manner shown in Figure 2; the hidden naive Bayesian method finally determines the identification set by calculating the conditional probability The flag bit of the process; the flag bit is 1 for a malicious process, and 0 for a regular process.

具体步骤如下：Specific steps are as follows:

步骤1：起始状态；步骤2：将待识别集和选定的训练集中的进程按照图2中的方式转化为元组；步骤3：使用隐藏朴素贝叶斯方法在训练集上建立识别模型；步骤4：隐藏朴素贝叶斯方法通过计算条件概率来最终确定识别集中的进程的标志位；步骤5：如果标志位为1进入6，否则进入7；步骤6：将该元组对应的进程确定为恶意进程；步骤7：将该元组对应的进程确定为常规进程；步骤8：采用隐藏朴素贝叶斯方法建立识别模型并进行检测完毕。Step 1: Initial state; Step 2: Convert the processes in the set to be identified and the selected training set into tuples as shown in Figure 2; Step 3: Use the hidden naive Bayesian method to build a recognition model on the training set ; Step 4: The hidden naive Bayesian method finally determines the flag bit of the process in the identification set by calculating the conditional probability; Step 5: If the flag bit is 1, enter 6, otherwise, enter 7; Step 6: The process corresponding to the tuple Determined as a malicious process; Step 7: Determine the process corresponding to the tuple as a regular process; Step 8: Use the hidden naive Bayesian method to establish a recognition model and complete the detection.

虽然本发明已以较佳实施例揭露如上，然其并非用以限定本发明。本发明所属技术领域中具有通常知识者，在不脱离本发明的精神和范围内，当可作各种的更动与润饰。因此，本发明的保护范围当视权利要求书所界定者为准。Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the claims.

Claims

1. A method for automatically detecting malicious processes in a forensics scene, characterized in that, comprising the following steps:

1) Establish a mapping method from dynamic link library data to N-tuples;

Definition 1: An N-tuple is a sequence of length N consisting of 0 or 1, where N is a non-negative integer;

Definition 2: The flag bit is a special bit added to the end of the N-tuple, which is used to indicate whether the process represented by this tuple is a malicious process, and it is used in the identification process of malware;

In order to quickly analyze and process the dynamic link library data through the recognition algorithm, it is necessary to map the dynamic link library data of each process into a data structure, that is, the N-tuple in definition 1; the dynamic link library set is the standard for mapping, For a dynamic link library set containing N dynamic link libraries, the corresponding data structure is an N+1 tuple, including a flag bit in definition 2, and the dynamic link library data structure corresponding to the process in the set to be identified does not have this One, for N yuan;

The mapping method is stated as follows:

a. Set each bit of the tuple to 0;

b. traverse as a standard dynamic link library set, for each dynamic link library, search in the dynamic link library data of the process, if it exists, set the corresponding position of the dynamic link library of the record corresponding to the process to 1 ;

c. If the process belongs to the training set or confirmation set, whether it is known as a malicious process, it is set to 1, if it is a set to be identified, that is, a test set, then remove the flag;

2) Calculate the relatively optimal key dynamic link library set according to the greedy algorithm;

After the mapping from N dynamic link library data to N-tuples is established, dynamic link libraries need to be selected to form a dynamic link library set, and these selected dynamic link libraries are called key dynamic link libraries; The selection of the set will affect the established detection model and then affect the recognition accuracy;

Definition 3: The theoretical optimal dynamic link library set is such a set that no other set performs better than this set in the same test set after using the same algorithm to model on the same training set. This theoretical optimal dynamic link library set depends on In the selection of training set, test set and algorithm;

Definition 4: The relatively optimal dynamic link library set is not equal to the theoretical optimal dynamic link library set, but it has a relatively good performance in the case of similar training sets and test sets. In addition, this set can pass controllable operations The complexity is obtained within certain steps;

When selecting key dynamic link libraries, since key dynamic link libraries are used to describe the common attributes of processes in each category, process-related dynamic link libraries cannot be selected as key dynamic link libraries; without prior knowledge, The first thing to examine is the system dynamic link library of windows, which is not directly used as the key dynamic link library, otherwise a "dimension explosion" will occur when building the model; considering that the probability of each dynamic link library in the sample is not even, the The mode that sample carries out statistics initializes key dynamic link library set, counts out the dynamic link library that all occurs in the routine software process of training set and malicious software process, and the set of their combination is I;

In order to improve the accuracy of recognition, more training data is often needed; if the dynamic link library data of the malicious process to be identified is different from the source of the dynamic link library used for training, it is necessary to first enumerate the upper and lower limits of the occurrence ratio of the dynamic link library dll Combination, and in the way of verification set and verification feedback to obtain an optimal dynamic link library set in the combination test as a start, the verification set must come from the same source as the data to be identified; thus mapping a number of possible combinations that are too large into a finitely computable interval;

The specific operation statement is as follows:

a. With a set step size, the step size is 0.5%, and a set method, set the lower limit from 0 to 49.5%, a total of 100, and the upper limit from 50% to 100%, a total of 101 to enumerate, so that a total of There are 10100 combinations of upper and lower limits;

b. For each combination, it corresponds to a key dynamic link library set, and the calculation method is that if a dynamic link library appears between the upper and lower limit combinations in the normal software process and malware process of the training set, then the The dynamic link library is put into the key dynamic link library set;

c. Test the key dynamic link library sets corresponding to these combinations one by one on the verification set to find out the relatively optimal key dynamic link library set S:

The combination of greedy algorithm and verification data is used to optimize the key dynamic link library set; by gradually adding the dynamic link library to the candidate key dynamic link library set for modeling and using the verification set to feedback the results, gradually find a relatively optimal key dynamic link library set;

The specific operation statement is as follows:

a. Assuming that there are n elements in the relatively optimal key dynamic link library set S, add elements not in S in set I to set S in a loop, adding one at a time, and get the result on the verification set by modeling, take it out, and add Next, repeat until the end of the loop;

b. Statistically obtain the set T with n+1 elements that has the best results on the verification set, replace S with it, and repeat a;

c. The process ends until the results on the verification set are no longer improved, and the final set S is obtained at this time;

3) Use the hidden naive Bayesian method to establish a recognition model and perform detection:

After obtaining the relatively optimal set of key dynamic link libraries, it is necessary to determine the modeling algorithm; due to the high degree of coupling between the dynamic link libraries, there is a strong interdependence, and the hidden naive Bayesian method (hidden naive Bayes) Suitable for dealing with high coupling; in the hidden naive Bayesian method, each attribute has a hidden parent node to express the influence of other attributes on the attribute;

After the hidden naive Bayesian method is determined, the training set is used to build a model, and the processes in the recognition set are converted into tuples according to the method in 1); the hidden naive Bayesian method finally determines the identification set by calculating the conditional probability The flag bit of the process; the flag bit is 1 for a malicious process, and 0 for a normal process.

2. The method for automatically detecting malicious processes in a forensics scene according to claim 1, the specific steps of the above step 1) are as follows:

Step 1)-1: initial state;

Step 1)-2: Set each bit of the tuple to 0;

Step 1)-3: Traversing the dynamic link library set, for each dynamic link library, search in the dynamic link library data of the target process, if it exists, put the location corresponding to the dynamic link library in the record corresponding to the target process set to 1;

Step 1)-4: Traversing each process, judging whether the process belongs to the training set or confirmation set, if yes, enter 1)-5, otherwise it is a set to be identified, enter 1)-6;

Step 1)-5: If the process is a malicious process, set the flag to 1, go to 1)-7 after traversal, otherwise continue to 1)-4;

Step 1)-6: Remove the flag, and go to 1)-7 after traversing, otherwise continue to 1)-4;

Steps 1)-7: The establishment of the mapping method from the dynamic link library data to the N-tuple is completed.

3. The method for automatically detecting malicious processes in a forensics scene according to claim 1, the specific steps of the above step 2) are as follows:

Step 2)-1: initial state;

Step 2)-2: count the dynamic link libraries that all occur in the routine software process and the malicious software process of the training set, and their combined set is I;

Step 2)-3: With a set step size, set the step size to 0.5%, and a set method, set the lower limit from 0 to 49.5%, a total of 100, and the upper limit from 50% to 100%, a total of 101 to enumerate all upper and lower limit combinations;

Step 2)-4: traverse all combinations in 2)-3, for each combination, enumerate all dynamic link libraries in I;

Step 2)-5: If the probability of the dynamic link library appears between the upper and lower limit combinations in the normal software process and the malware process in the training set, put the dynamic link library into the set, if the enumeration is complete, enter 2)-6, otherwise return 2)-5;

Step 2)-6: Enumerate the set corresponding to each combination of upper and lower limits, test on the verification set, and record the results;

Step 2)-7: If the enumeration is complete, find the optimal set S, otherwise return to 2)-6;

Step 2)-8: Assuming that there are n elements in the set S, add the elements not in S in the set I to the set S in a loop, adding one at a time to obtain the set X;

Step 2)-9: Model on X, get the result on the verification set, take out the added elements, and restore S;

Step 2)-10: If the loop ends, go to 2)-11, otherwise go back to 2)-8;

Step 2)-11: Obtain the set T with n+1 elements that has the best result on the verification set, and replace S with it;

Step 2)-12: If the result on the verification set is no longer improved during the process of 2)-11, stop and get a relatively optimal dynamic link library set, otherwise return to 2)-8;

Steps 2)-13: Completing the calculation of the relatively optimal key dynamic link library set according to the greedy algorithm.

4. The method for automatically detecting malicious processes in a forensics scene according to claim 1, the specific steps of the above step 3) are as follows:

Step 3)-1: initial state;

Step 3)-2: convert the processes in the set to be identified and the selected training set into tuples in the manner in 1);

Step 3)-3: Use the hidden naive Bayes method to build a recognition model on the training set;

Step 3)-4: The hidden naive Bayesian method calculates the conditional probability to finally determine the flag bit of the process in the recognition set;

Step 3)-5: If the flag is 1, go to 3)-6, otherwise go to 3)-7;

Step 3)-6: determine the process corresponding to the tuple as a malicious process;

Step 3)-7: determine the process corresponding to the tuple as a regular process;

Steps 3)-8: Use the hidden naive Bayesian method to establish a recognition model and complete the detection.