CN116644422A

CN116644422A - Malicious code detection method based on malicious block labeling and image processing

Info

Publication number: CN116644422A
Application number: CN202310606050.3A
Authority: CN
Inventors: 张乾坤; 张伯瑜
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-08-25

Abstract

The invention discloses a malicious code detection method based on malicious block marking and image processing, which belongs to the field of malicious code detection, including: (S1) dividing the binary file of the malicious code to be detected into a plurality of basic blocks, and detecting each basic block Whether the block is a malicious block, mark the location of the malicious block in the binary file; the malicious block is a basic block related to malicious functions; (S2) convert the binary file into a grayscale image, and improve the grayscale image corresponding to the malicious block Local contrast of part of the image to obtain the target grayscale image; (S3) input the target grayscale image to the trained malicious code classification model to predict the probability that the malicious code belongs to each family category, and determine the family category with the highest probability as malicious The family category the code belongs to. The invention can enhance the degree of influence of malicious function-related content in malicious codes on classification results, thereby improving the accuracy of malicious code classification.

Description

A Malicious Code Detection Method Based on Malicious Block Annotation and Image Processing

技术领域technical field

本发明属于恶意代码检测领域，更具体地，涉及一种基于恶意块标注和图像处理的恶意代码检测方法。The invention belongs to the field of malicious code detection, and more specifically relates to a malicious code detection method based on malicious block marking and image processing.

背景技术Background technique

网络安全行业一直致力于防治恶意代码的攻击行为，攻击者可利用恶意代码感染受害设备，达到破坏用户和企业数据资源保密性、完整性的目的，因此，准确检测出恶意代码并采取相应的措施，对于保证网络安全而言具有十分重要的意义。The network security industry has been committed to preventing malicious code attacks. Attackers can use malicious codes to infect victim devices to achieve the purpose of destroying the confidentiality and integrity of user and enterprise data resources. Therefore, it is necessary to accurately detect malicious codes and take corresponding measures , which is very important for ensuring network security.

传统上，恶意软件检测或分类是通过基于签名或启发式的方法来执行的。基于签名的方法为不同的恶意软件家族和变体部署签名，作为原型，并允许对新发现的恶意软件文件进行相应的分类，确定对应的家族类别，即可根据该类家族恶意代码的特点，采取相应的应对措施。在过去的几年里，Nataraj等人引入了一种称为恶意软件可视化的静态恶意软件分析技术，其中，恶意软件可视化是一种将恶意软件二进制文件的内容以某种形式表示为图像的技术，具体地，恶意软件二进制文件的原始字节被读取为8位无符号整数，并存储到向量中，该向量被重新成形为矩阵，然后可以被可视化为灰度图像。Traditionally, malware detection or classification is performed through signature-based or heuristic methods. The signature-based method deploys signatures for different malware families and variants as prototypes, and allows the corresponding classification of newly discovered malware files to determine the corresponding family category. According to the characteristics of the malicious code of this type of family, Take corresponding countermeasures. In the past few years, Nataraj et al. have introduced a static malware analysis technique called malware visualization, where malware visualization is a technique that represents the content of a malware binary in some form as an image , specifically, the raw bytes of the malware binary are read as 8-bit unsigned integers and stored into a vector, which is reshaped into a matrix that can then be visualized as a grayscale image.

上述恶意软件可视化的恶意软件分析方法有效解决了恶意软件分类的问题。但由于整个恶意软件的恶意功能是嵌套在其他非恶意功能之中，也即是说，恶意软件的相当一部分内容与恶意功能无关，因此直接对整个恶意软件二进制文件转化得到的灰度图像进行分类时，恶意功能无关的内容会影响整个分类的结果，最终的分类准确度得不到保证。The malware analysis method for malware visualization described above effectively solves the problem of malware classification. However, since the malicious function of the entire malware is embedded in other non-malicious functions, that is to say, a considerable part of the malware has nothing to do with the malicious function, so the grayscale image converted from the entire malware binary file is directly processed When classifying, content irrelevant to malicious functions will affect the result of the entire classification, and the final classification accuracy cannot be guaranteed.

发明内容Contents of the invention

针对现有技术的缺陷和改进需求，本发明提供了一种基于恶意块标注和图像处理的恶意代码检测方法，其目的在于，增强恶意代码中恶意功能相关的内容对于分类结果的影响程度，以提高恶意代码分类的准确率，便于对未知的恶意代码进行正确的认知和分析。Aiming at the defects and improvement needs of the prior art, the present invention provides a malicious code detection method based on malicious block annotation and image processing. Improve the accuracy of malicious code classification, and facilitate the correct recognition and analysis of unknown malicious codes.

为实现上述目的，按照本发明的一个方面，提供了一种基于恶意块标注和图像处理的恶意代码检测方法，包括如下步骤：In order to achieve the above object, according to one aspect of the present invention, a malicious code detection method based on malicious block annotation and image processing is provided, including the following steps:

(S1)将待检测的恶意代码的二进制文件划分为多个基本块，并检测各基本块是否是恶意块，对二进制文件中恶意块所在的位置进行标注；恶意块为恶意功能相关的基本块；(S1) Divide the binary file of the malicious code to be detected into a plurality of basic blocks, and detect whether each basic block is a malicious block, and mark the position of the malicious block in the binary file; the malicious block is a basic block related to malicious functions ;

(S2)将二进制文件转换为灰度图，并提升灰度图中恶意块对应的部分图像的局部对比度，得到目标灰度图；(S2) converting the binary file into a grayscale image, and improving the local contrast of the part of the image corresponding to the malicious block in the grayscale image to obtain the target grayscale image;

(S3)将目标灰度图输入至训练好的恶意代码分类模型，以预测恶意代码属于各家族类别的概率，将概率最高的家族类别确定为恶意代码所属的家族类别；(S3) Input the target grayscale image into the trained malicious code classification model to predict the probability that the malicious code belongs to each family category, and determine the family category with the highest probability as the family category to which the malicious code belongs;

其中，恶意代码分类模型为神经网络模型，用于预测输入的灰度图对应的恶意代码属于各家族类别的概率。Wherein, the malicious code classification model is a neural network model, which is used to predict the probability that the malicious code corresponding to the input grayscale image belongs to each family category.

进一步地，步骤(S1)中，对于任意一个基本块，检测其是否是恶意块，其方式包括：Further, in the step (S1), for any basic block, whether it is detected as a malicious block, the methods include:

提取基本块的代码特征，并转转换为特征向量；代码特征包括结构特征、算数指令特征、转移指令特征和API调用特征；Extract code features of basic blocks and convert them into feature vectors; code features include structural features, arithmetic instruction features, transfer instruction features and API call features;

将特征向量输入至已训练好的恶意块检测模型，由恶意块检测模块对特征向量进行特征提取并重构，得到重构特征；The feature vector is input to the trained malicious block detection model, and the feature vector is extracted and reconstructed by the malicious block detection module to obtain the reconstructed feature;

若恶意块检测模块输出的重构特征与特征向量之间的差异大于预设阈值，则判定基本块为恶意块；否则，判定基本块不是恶意块；If the difference between the reconstructed feature and the feature vector output by the malicious block detection module is greater than a preset threshold, it is determined that the basic block is a malicious block; otherwise, it is determined that the basic block is not a malicious block;

其中，恶意块检测模型为神经网络模型，用于对输入的基本块进行特征提取并重构，其训练方式包括：Among them, the malicious block detection model is a neural network model, which is used to extract and reconstruct the features of the input basic blocks, and its training methods include:

收集恶意功能无关的二进制文件，划分为基本块并提取基本块的代码特征作为良性样本，得到良性样本集；Collect binary files irrelevant to malicious functions, divide them into basic blocks and extract the code features of the basic blocks as benign samples to obtain a benign sample set;

初始化恶意块检测模型，并以最小化重构损失为目标，利用良性样本集对其进行训练，训练结束后，得到训练好的恶意块检测模型。Initialize the malicious block detection model, aim at minimizing the reconstruction loss, and use the benign sample set to train it. After the training, the trained malicious block detection model is obtained.

进一步地，结构特征包括：基本块的子代数量和中间值；算数指令特征包括基本块包含的基本数学、位移指令以及逻辑运算的数量；转移指令特征包括基本块内堆栈操作、寄存器操作和端口操作的数量；API调用特征包括基本块内dll、process、service、systeminformation相关API的调用数量。Further, the structural features include: the number of children and intermediate values of the basic block; the arithmetic instruction feature includes the basic mathematics, displacement instructions and the number of logical operations contained in the basic block; the transfer instruction feature includes stack operations, register operations and port operations in the basic block The number of operations; API call features include the number of calls to APIs related to dll, process, service, and system information in the basic block.

进一步地，恶意块检测模型为自编码器模型。Further, the malicious block detection model is an autoencoder model.

进一步地，步骤(S2)中，通过限制对比度自适应直方图均衡化算法提升灰度图中恶意块对应的部分图像的局部对比度。Further, in step (S2), the local contrast of the part of the image corresponding to the malicious block in the grayscale image is improved by using a limited contrast adaptive histogram equalization algorithm.

进一步地，恶意代码分类模型为Vision Transformer模型。Further, the malicious code classification model is a Vision Transformer model.

进一步地，步骤(S2)中，将二进制文件转换为灰度图，包括：Further, in step (S2), the binary file is converted into a grayscale image, including:

按照代码顺序将每8个比特作为一个单位转换为无符号整型数，并将数值存储到一个无符号整型向量中；Convert each 8 bits as a unit into an unsigned integer in code order, and store the value in an unsigned integer vector;

将无符号整型向量转换为矩阵，将矩阵中每个元素作为一个像素，并将元素的数值作为对应像素的灰度值，得到灰度图像。Convert the unsigned integer vector to a matrix, treat each element in the matrix as a pixel, and use the value of the element as the gray value of the corresponding pixel to obtain a gray image.

按照本发明的又一个方面，提供了一种计算机可读存储介质，包括：存储的计算机程序；计算机程序被处理器执行时，控制计算机可读存储介质所在设备执行本发明提供的基于恶意块标注和图像处理的恶意代码检测方法。According to yet another aspect of the present invention, a computer-readable storage medium is provided, including: a stored computer program; when the computer program is executed by a processor, it controls the device where the computer-readable storage medium is located to execute the malicious block marking provided by the present invention Malicious code detection method for image processing.

总体而言，通过本发明所构思的以上技术方案，能够取得以下有益效果：Generally speaking, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:

(1)本发明将待检测的恶意代码的二进制文件划分为基本块，检测其中的恶意块，即恶意功能相关的基本块，并在整个二进制文件中对基本块所在位置进行标注，实现了恶意块的定位，在后续基于可视化技术进行分类时，会基于恶意块的位置标注结果，对二进制文件转换所得灰度图中恶意块对应的部分图像进行图像处理，以提高该部分图像的局部对比度，基于该处理，能够有效提高恶意块相关内容对于分类结果的权重，削弱恶意功能无关内容对于分类结果的影响，有效提高恶意代码检测的准确度。(1) The present invention divides the binary file of the malicious code to be detected into basic blocks, detects the malicious block therein, that is, the basic block related to the malicious function, and marks the position of the basic block in the entire binary file, realizing malicious Block positioning, in the subsequent classification based on visualization technology, based on the position labeling results of the malicious block, image processing will be performed on the part of the image corresponding to the malicious block in the grayscale image converted from the binary file to improve the local contrast of the part of the image. Based on this processing, the weight of malicious block-related content on the classification result can be effectively increased, the influence of malicious function-independent content on the classification result can be weakened, and the accuracy of malicious code detection can be effectively improved.

(2)在本发明的优选方案中，利用神经网络模型作为恶意块检测模型，该模型用于对输入的基本块进行特征提取并重构，可以实现异常检测的功能，该模型由恶意功能无关的良性样本训练而成，对于良性的基本块，模型的输入和输出差异较小，而对于恶意块，模型的输入和输出差异则会较大，基于此，本发明能够准确完成二进制代码中恶意块的判断与定位。(2) In the preferred solution of the present invention, the neural network model is used as the malicious block detection model, which is used for feature extraction and reconstruction of the input basic blocks, and can realize the function of anomaly detection. For benign basic blocks, the difference between the input and output of the model is small, but for malicious blocks, the difference between the input and output of the model is relatively large. Based on this, the present invention can accurately complete the binary code. Block judgment and positioning.

(3)在本发明的优选方案中，在进行恶意块的检测时，所提取的基本块的代码特征包括结构特征、算数指令特征、转移指令特征和API调用特征，其中，结构特征包括：基本块的子代数量和中间值；算数指令特征包括基本块包含的基本数学、位移指令以及逻辑运算的数量；转移指令特征包括基本块内堆栈操作、寄存器操作和端口操作的数量；API调用特征包括基本块内dll、process、service、system information相关API的调用数量，这些特征能够全面、准确地反映基本块所实现的功能，本发明以这些特征作为恶意块检测模型的输入，能够准确识别出基本块的功能是否为恶意功能，从而准确完成恶意块的检测。(3) In the preferred solution of the present invention, when detecting malicious blocks, the code features of the extracted basic blocks include structural features, arithmetic instruction features, transfer instruction features and API call features, wherein the structural features include: basic The number of descendants and intermediate values of the block; the characteristics of arithmetic instructions include the number of basic mathematics, displacement instructions and logical operations contained in the basic block; the characteristics of transfer instructions include the number of stack operations, register operations and port operations in the basic block; API call characteristics include The number of API calls related to dll, process, service, and system information in the basic block, these features can comprehensively and accurately reflect the functions realized by the basic block, the present invention uses these features as the input of the malicious block detection model, and can accurately identify the basic Whether the function of the block is a malicious function, so as to accurately complete the detection of malicious blocks.

(4)在本发明的优选方案中，具体使用限制对比度自适应直方图均衡化算法(Contrast Limited Adaptive Histogram Equalization，CLAHE)对灰度图中恶意块对应的部分图像进行局部对比度的提升，能够在提升局部对比度的同时，减少噪声放大问题。(4) In the preferred solution of the present invention, the local contrast of the part of the image corresponding to the malicious block in the grayscale image is improved by using the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm, which can be used in While improving local contrast, it reduces the problem of noise amplification.

(5)在本发明的优选方案中，具体使用Vision Transformer模型实现恶意代码分类模型，该模型会将输入的图像切分成很多子块并将这些子块组成线性嵌入序列，然后将这些线性嵌入序列作为Transformer的输入以模拟在NLP领域中词组序列输入，基于该模型，本发明对于恶意代码二进制文件转换所得的灰度图具有较好的分类效果。(5) In the preferred solution of the present invention, specifically use the Vision Transformer model to realize the malicious code classification model, which will divide the input image into many sub-blocks and form these sub-blocks into a linear embedding sequence, and then these linear embedding sequences It is used as the input of Transformer to simulate the input of phrase sequence in the field of NLP. Based on this model, the present invention has a better classification effect on the grayscale images converted from malicious code binary files.

附图说明Description of drawings

图1为本发明实施例提供的基于恶意块标注和图像处理的恶意代码检测方法流程图；Fig. 1 is a flowchart of a malicious code detection method based on malicious block annotation and image processing provided by an embodiment of the present invention;

图2为本发明实施例提供的本发明实施例提出的从图像生成、图像处理到模型训练以及模型验证的示意图；Fig. 2 is a schematic diagram from image generation, image processing to model training and model verification proposed by the embodiment of the present invention provided by the embodiment of the present invention;

图3为本发明实施例提供的恶意块检测模型示意图；FIG. 3 is a schematic diagram of a malicious block detection model provided by an embodiment of the present invention;

图4为本发明实施例提供的对灰度图像应用限制对比度自适应直方图均衡化的实施过程图；Fig. 4 is an implementation process diagram of applying limited contrast adaptive histogram equalization to grayscale images provided by an embodiment of the present invention;

图5为本发明实施例提供的双线性插值方法的示意图。Fig. 5 is a schematic diagram of a bilinear interpolation method provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

在本发明中，本发明及附图中的术语“第一”、“第二”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。In the present invention, the terms "first", "second" and the like (if any) in the present invention and drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

为了解决现有的恶意代码检测方法的分类结果受恶意功能无关内容的干扰，分类准确度较低的技术问题，本发明提供了一种基于恶意块标注和图像处理的恶意代码检测方法，其整体思路在于：对待检测的恶意代码二进制文件中的恶意块进行定位和标注，并对该二进制文件转换所得的灰度图中恶意块对应的部分图像进行图像处理以提升局部对比度，由此提高恶意功能相关内容对于分类结果的重要性，削弱恶意功能无关内容的影响，提高分类准确度。In order to solve the technical problem that the classification result of the existing malicious code detection method is interfered by malicious function irrelevant content and the classification accuracy is low, the present invention provides a malicious code detection method based on malicious block annotation and image processing. The idea is to locate and mark the malicious block in the malicious code binary file to be detected, and perform image processing on the part of the image corresponding to the malicious block in the grayscale image converted from the binary file to improve the local contrast, thereby improving the malicious function. The importance of relevant content to classification results weakens the influence of malicious function irrelevant content and improves classification accuracy.

以下为实施例。The following are examples.

实施例1：Example 1:

一种基于恶意块标注和图像处理的恶意代码检测方法，如图1和图2所示，包括如下步骤：A malicious code detection method based on malicious block annotation and image processing, as shown in Figure 1 and Figure 2, includes the following steps:

(S1)将待检测的恶意代码的二进制文件划分为多个基本块，并检测各基本块是否是恶意块，对二进制文件中恶意块所在的位置进行标注；恶意块为恶意功能相关的基本块。(S1) Divide the binary file of the malicious code to be detected into a plurality of basic blocks, and detect whether each basic block is a malicious block, and mark the position of the malicious block in the binary file; the malicious block is a basic block related to malicious functions .

基本块，即顺序执行的指令序列，仅包括一个输入和一个输出。本实施例通过将二进制文件划分为基本块，并判断基本块是否是恶意功能相关的，能够有效实现恶意块的定位和标注。A basic block, a sequence of instructions executed sequentially, consists of only one input and one output. In this embodiment, by dividing the binary file into basic blocks and judging whether the basic blocks are related to malicious functions, the location and labeling of malicious blocks can be effectively realized.

可选地，本实施例的步骤(S1)中，对于任意一个基本块，检测其是否是恶意块，其方式包括：Optionally, in the step (S1) of the present embodiment, for any basic block, whether it is detected as a malicious block, the methods include:

由于恶意块检测模型会对输入的基本块进行特征提取并重构，可以实现异常检测的功能，而本实施例具体利用恶意功能无关的良性样本对其进行训练，因此，该模型对于良性的基本块，模型的输入和输出差异较小，而对于恶意块，模型的输入和输出差异则会较大，基于此，能够准确完成二进制代码中恶意块的判断与定位；可选地，本实施例中，选取U-Net模型实现恶意块检测模型，U-Net模型是一种自编码器(AutoEncoder)模型，自编码器是一类在半监督学习和非监督学习中使用的人工神经网络，其功能是通过将输入信息作为学习目标，对输入信息进行表征学习；Since the malicious block detection model extracts and reconstructs the features of the input basic blocks, it can realize the function of anomaly detection, and this embodiment uses benign samples that have nothing to do with malicious functions to train it. block, the difference between the input and output of the model is small, and for the malicious block, the difference between the input and output of the model will be relatively large, based on this, the judgment and location of the malicious block in the binary code can be accurately completed; optionally, this embodiment Among them, the U-Net model is selected to realize the malicious block detection model. The U-Net model is an autoencoder (AutoEncoder) model, and the autoencoder is a type of artificial neural network used in semi-supervised learning and unsupervised learning. The function is to learn the representation of the input information by taking the input information as the learning target;

U-Net模型的结构如图3所示，模型包含编码器g和解码器f，当我们输入一个x，经过整个神经网络之后可以得到一个输出x′，即：The structure of the U-Net model is shown in Figure 3. The model includes an encoder g and a decoder f. When we input an x, we can get an output x′ after passing through the entire neural network, namely:

f(g(x))＝x′f(g(x))=x'

自动编码器则是重构损失x′-x作为损失，不断学习使得x与x′的差距逐渐变小，因此，使用大量良性样本进行学习后，对于良性的基本块，x与x′的差距较小，而对于恶意的基本块差距则会较大，因此可以根据此有效检测出可能存在的恶意基本块。The autoencoder uses the reconstruction loss x′-x as the loss, and continuously learns to make the gap between x and x′ gradually smaller. Therefore, after learning with a large number of benign samples, for benign basic blocks, the gap between x and x′ is small, and the gap for malicious basic blocks will be large, so the possible malicious basic blocks can be effectively detected based on this.

容易理解的是，为了保证模型的训练效果，本实施例会收集大量的恶意功能无关的二进制文件来制作良性样本，用于训练恶意块检测模型；在恶意块检测模型训练完成后，还会利用实现恶意功能的恶意代码基本块制作恶意样本，对训练所得模型的训练效果进行测试，以确保检测模型的检测准确度满足要求，如图2所示；此外，在本发明其他的一些实施例中，恶意块检测模型也可以基于其他可进行特征提取及重构的模型实现。It is easy to understand that in order to ensure the training effect of the model, this embodiment will collect a large number of binary files that have nothing to do with malicious functions to make benign samples for training the malicious block detection model; after the malicious block detection model is trained, it will also use the implementation The malicious code basic block of the malicious function makes a malicious sample, and tests the training effect of the trained model to ensure that the detection accuracy of the detection model meets the requirements, as shown in Figure 2; in addition, in some other embodiments of the present invention, The malicious block detection model can also be implemented based on other models that can perform feature extraction and reconstruction.

为了准确识别基本块所实现的功能是否是恶意功能，本实施例中，提取的基本块的代码特征中，结构特征具体包括：基本块的子代数量和中间值；算数指令特征具体包括基本块包含的基本数学、位移指令以及逻辑运算的数量；转移指令特征具体包括基本块内堆栈操作、寄存器操作和端口操作的数量；API调用特征具体包括基本块内dll、process、service、system information相关API的调用数量。本实施例所考虑的上述四类共12个特征能够能够全面、准确地反映基本块所实现的功能，本实施例以这些特征作为恶意块检测模型的输入，能够准确识别出基本块的功能是否为恶意功能，从而准确完成恶意块的检测。在实际应用中，可直接利用BinaryNinja工具提取基本块的代码特征。In order to accurately identify whether the function implemented by the basic block is a malicious function, in this embodiment, among the code features of the extracted basic block, the structural features specifically include: the number of children of the basic block and the intermediate value; the arithmetic instruction features specifically include the basic block The number of basic mathematics, displacement instructions, and logic operations included; the characteristics of transfer instructions include the number of stack operations, register operations, and port operations in the basic block; the characteristics of API calls include APIs related to dll, process, service, and system information in the basic block number of calls. The 12 features of the above four categories considered in this embodiment can fully and accurately reflect the functions realized by the basic block. This embodiment uses these features as the input of the malicious block detection model, and can accurately identify whether the function of the basic block is It is a malicious function, so as to accurately complete the detection of malicious blocks. In practical applications, the code features of basic blocks can be extracted directly by using the BinaryNinja tool.

经过步骤(S1)，本实施例可准确完成二进制文件中恶意块的定位和标注，在此基础上，本实施例还包括步骤：After step (S1), the present embodiment can accurately complete the positioning and labeling of malicious blocks in the binary file. On this basis, the present embodiment also includes steps:

(S2)将二进制文件转换为灰度图，并提升灰度图中恶意块对应的部分图像的局部对比度，得到目标灰度图。(S2) Convert the binary file into a grayscale image, and increase the local contrast of a part of the image corresponding to the malicious block in the grayscale image to obtain a target grayscale image.

本实施例中，将二进制文件转换成灰度图的具体方式为：In this embodiment, the specific way to convert the binary file into a grayscale image is as follows:

容易理解的是，在灰度图转换过程中，每个像素对应8比特无符号整型数的数值范围是0～255，与像素的灰度值，0～255对应，其中0对应黑色，255对应白色。It is easy to understand that in the grayscale conversion process, each pixel corresponds to an 8-bit unsigned integer value ranging from 0 to 255, which corresponds to the grayscale value of the pixel, 0 to 255, where 0 corresponds to black, and 255 Corresponds to white.

根据步骤(S1)的标注结果，可以定位到在转换所得的灰度图中恶意块对应的部分图像，通过图像处理的手段即可提高该部分图像的局部对比度，作为一种优选的实施方式，本实施例中，具体通过限制对比度自适应直方图均衡化算法(Contrast Limited AdaptiveHistogram Equalization，CLAHE)提升灰度图中恶意块对应的部分图像的局部对比度，能够在提升局部对比度的同时，减少噪声放大问题；基于CLAHE提升局部对比度的过程如图4所示，包括如下步骤：According to the labeling result of step (S1), the part of the image corresponding to the malicious block in the converted grayscale image can be located, and the local contrast of the part of the image can be improved by means of image processing. As a preferred embodiment, In this embodiment, the local contrast of the part of the image corresponding to the malicious block in the grayscale image is improved by using Contrast Limited Adaptive Histogram Equalization (CLAHE), which can reduce noise amplification while improving the local contrast. Problem; The process of improving local contrast based on CLAHE is shown in Figure 4, including the following steps:

(S21)根据恶意块在图像的位置，取局部的规则图像，确定将局部图像划分为大小相等的不重叠子块；(S21) According to the position of the malicious block in the image, get a local regular image, and determine that the local image is divided into non-overlapping sub-blocks of equal size;

(S22)根据图像计算子块直方图；(S22) Calculate the sub-block histogram according to the image;

(S23)根据子操作S22中提供的子块直方图，计算出clipLimit；(S23) Calculate clipLimit according to the sub-block histogram provided in the sub-operation S22;

(S24)把各子块图像的灰度直方图中超出clipLimit值的像素截取出来，将截取下来的像素重新均匀分配给各个灰度级；(S24) Intercept the pixels exceeding the clipLimit value in the grayscale histogram of each sub-block image, and evenly redistribute the intercepted pixels to each grayscale;

(S25)使用双线性插值的方法实现像素点灰度值重构，最终实现直方图均衡。(S25) Using a bilinear interpolation method to realize pixel gray value reconstruction, and finally realize histogram equalization.

如图5所示，灰度图中每个像素点的横坐标表示当前像素值，纵坐标表示变换后的像素值。按照上述方法对图像进行分块处理后，每个子块均衡化时采用的灰度变换函数均不相同。包括如下步骤：As shown in FIG. 5 , the abscissa of each pixel in the grayscale image represents the current pixel value, and the ordinate represents the transformed pixel value. After the image is divided into blocks according to the above method, the gray scale transformation functions used in the equalization of each sub-block are different. Including the following steps:

1)首先根据像素值，可以把整个图形区域划分为三类区域A、B、C，分别代表四角区域、边缘区域和中心区域。1) Firstly, according to the pixel value, the entire graphics area can be divided into three types of areas A, B, and C, which represent the four-corner area, the edge area and the central area respectively.

2)对图像中的像素点进行逐一判断，确定其属于哪类区域。不同区域的像素点采用不同的处理方式。2) Judge the pixels in the image one by one to determine which type of area they belong to. Pixels in different regions are processed in different ways.

3)如果像素点属于A类区域，则该不对该像素点进行任何插值运算，直接应用灰度变换函数进行灰度变换：3) If the pixel belongs to the A-type area, no interpolation operation should be performed on the pixel, and the gray-scale transformation function is directly applied for gray-scale transformation:

其中cdf(x)表示子图中像素值为x的累积分布值。cdf_min与cdf_max则分别表示子图像素累积分布中的最小值和最大值。L表示灰度等级总数，通常为256；where cdf(x) represents the cumulative distribution value of the pixel value x in the submap. cdf _min and cdf _max represent the minimum and maximum values in the cumulative distribution of sub-image pixels, respectively. L represents the total number of gray levels, usually 256;

4)如果像素点属于B类区域，则将该像素点相邻的A类区域对应的变换函数分别标记为和/>并取属于两个A类区域的两点M、N，使得MN与该像素点处于同一水平线上。M和N点的像素点分别标记为x₁和x₂，则该像素点应用线性插值变换：4) If the pixel belongs to the B-type area, the transformation function corresponding to the A-type area adjacent to the pixel is marked as and /> And take two points M and N belonging to two A-type areas, so that MN and the pixel point are on the same horizontal line. The pixels of points M and N are marked as x ₁ and x ₂ respectively, then the pixel points are applied with linear interpolation transformation:

5)如果像素点属于C类区域，参考图4中的点P，对于点P，应用双线性插值变换：5) If the pixel point belongs to the C-type area, refer to the point P in Figure 4, and for the point P, apply bilinear interpolation transformation:

经过CLAHE的图像处理后，应用例如等距缩放采样等方式，对图像统一尺寸。After image processing by CLAHE, methods such as equidistant scaling and sampling are applied to unify the size of the image.

经过以上步骤(S2)，恶意代码二进制文件被转换成了灰度图，并且，其中恶意块对应的部分图像的局部对比度得到了有效提高，基于此，本实施例进一步包括：After the above steps (S2), the malicious code binary file is converted into a grayscale image, and the local contrast of the part of the image corresponding to the malicious block is effectively improved. Based on this, this embodiment further includes:

(S3)将目标灰度图输入至训练好的恶意代码分类模型，以预测恶意代码属于各家族类别的概率，将概率最高的家族类别确定为恶意代码所属的家族类别；其中，恶意代码分类模型为神经网络模型，用于预测输入的灰度图对应的恶意代码属于各家族类别的概率；(S3) Input the target grayscale image into the trained malicious code classification model to predict the probability that the malicious code belongs to each family category, and determine the family category with the highest probability as the family category to which the malicious code belongs; wherein, the malicious code classification model is a neural network model used to predict the probability that the malicious code corresponding to the input grayscale image belongs to each family category;

作为一种优选的实施方式，本实施例中，恶意代码分类模型为VisionTransformer模型。As a preferred implementation manner, in this embodiment, the malicious code classification model is a VisionTransformer model.

Transformer是Google团队在2017年提出来的一种端对端的NLP模式，该模型放弃使用传统的RNN顺序结构而采用self-attention机制来使得模型能够并行化训练且掌握全局信息。Vision Transformer可以看成是Transformer的图形版本，在尽可能少的改造下将标准的Transformer模型直接迁移至图像领域变成Vision Transformer模型，VisionTransformer模型会将输入的图像切分成很多子块并将这些子块组成线性嵌入序列，然后将这些线性嵌入序列作为Transformer的输入以模拟在NLP领域中词组序列输入，在本实施例的应用场景之下，基于该模型，对于恶意代码二进制文件转换所得的灰度图具有较好的分类效果。Transformer is an end-to-end NLP model proposed by the Google team in 2017. This model abandons the traditional RNN sequential structure and uses a self-attention mechanism to enable the model to parallelize training and grasp global information. Vision Transformer can be regarded as a graphical version of Transformer. With as little modification as possible, the standard Transformer model is directly transferred to the image field to become a Vision Transformer model. The Vision Transformer model will divide the input image into many sub-blocks and divide these sub-blocks Blocks form a linear embedding sequence, and then these linear embedding sequences are used as the input of the Transformer to simulate the input of the phrase sequence in the NLP field. Under the application scenario of this embodiment, based on this model, for the grayscale of the malicious code binary file conversion The graph has a better classification effect.

应当说明的是，在分类准确度满足要求的情况下，也可使用其他的图像分类模型。It should be noted that, if the classification accuracy meets the requirements, other image classification models can also be used.

总的来说，本实施例对恶意代码二进制文件中的恶意块进行定位，然后在图像中对恶意块部分进行局部对比度的提升，能够有效提升分类的准确度。In general, this embodiment locates the malicious block in the malicious code binary file, and then improves the local contrast of the malicious block in the image, which can effectively improve the classification accuracy.

实施例2：Example 2:

一种计算机可读存储介质，包括：存储的计算机程序；计算机程序被处理器执行时，控制计算机可读存储介质所在设备执行上述实施例1提供的基于恶意块标注和图像处理的恶意代码检测方法。A computer-readable storage medium, comprising: a stored computer program; when the computer program is executed by a processor, the device where the computer-readable storage medium is located is controlled to execute the malicious code detection method based on malicious block annotation and image processing provided in Embodiment 1 above .

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A malicious code detection method based on malicious block annotation and image processing, characterized in that, comprising the steps:

(S1) Divide the binary file of the malicious code to be detected into a plurality of basic blocks, and detect whether each basic block is a malicious block, and mark the position of the malicious block in the binary file; the malicious block is a malicious function related basic blocks;

(S2) converting the binary file into a grayscale image, and increasing the local contrast of a part of the image corresponding to the malicious block in the grayscale image to obtain a target grayscale image;

(S3) Input the target grayscale image into the trained malicious code classification model to predict the probability that the malicious code belongs to each family category, and determine the family category with the highest probability as the family category to which the malicious code belongs;

Wherein, the malicious code classification model is a neural network model, which is used to predict the probability that the malicious code corresponding to the input grayscale image belongs to each family category.

2. the malicious code detection method based on malicious block labeling and image processing as claimed in claim 1, is characterized in that, in described step (S1), for any basic block, detects whether it is malicious block, its mode comprises :

Extracting the code feature of the basic block, and turning it into a feature vector; the code feature includes a structural feature, an arithmetic instruction feature, a transfer instruction feature and an API call feature;

The feature vector is input into the trained malicious block detection model, and the feature vector is extracted and reconstructed by the malicious block detection module to obtain the reconstructed feature;

If the difference between the reconstructed feature output by the malicious block detection module and the feature vector is greater than a preset threshold, it is determined that the basic block is a malicious block; otherwise, it is determined that the basic block is not a malicious block;

Wherein, the malicious block detection model is a neural network model, which is used for feature extraction and reconstruction of the input basic blocks, and its training methods include:

Collect binary files irrelevant to malicious functions, divide them into basic blocks and extract the code features of the basic blocks as benign samples to obtain a benign sample set;

Initialize the malicious block detection model, aim at minimizing the reconstruction loss, use the benign sample set to train it, and obtain the trained malicious block detection model after the training.

3. the malicious code detection method based on malicious block labeling and image processing as claimed in claim 2, is characterized in that, described structural feature comprises: the descendant number of basic block and intermediate value; Described arithmetic instruction feature comprises basic block The number of basic mathematics, displacement instructions and logic operations included; the characteristics of the transfer instruction include the number of stack operations, register operations and port operations in the basic block; the characteristics of the API call include dll, process, service, and system information in the basic block. The number of API calls.

4. The malicious code detection method based on malicious block annotation and image processing as claimed in claim 3, wherein the malicious block detection model is an autoencoder model.

5. The malicious code detection method based on malicious block labeling and image processing according to any one of claims 1 to 4, characterized in that, in the step (S2), the adaptive histogram equalization algorithm is improved by limiting the contrast The local contrast of the part of the image corresponding to the malicious block in the grayscale image.

6. The malicious code detection method based on malicious block labeling and image processing according to any one of claims 1 to 4, wherein the malicious code classification model is a Vision Transformer model.

7. The malicious code detection method based on malicious block annotation and image processing according to any one of claims 1 to 4, wherein in the step (S2), the binary file is converted into a grayscale image, include:

Convert each 8 bits as a unit into an unsigned integer in code order, and store the value in an unsigned integer vector;

Converting the unsigned integer vector into a matrix, using each element in the matrix as a pixel, and using the value of the element as the gray value of the corresponding pixel to obtain the gray image.

8. A computer-readable storage medium, comprising: a stored computer program; when the computer program is executed by a processor, it controls the device where the computer-readable storage medium is located to perform any one of claims 1-7 The malicious code detection method based on malicious block annotation and image processing.