CN116821053B

CN116821053B - Data reporting methods, devices, computer equipment and storage media

Info

Publication number: CN116821053B
Application number: CN202311103374.1A
Authority: CN
Inventors: 韩孟玲; 白冰; 张兴明; 申大坤; 孙天宁
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-21
Anticipated expiration: 2043-08-30
Also published as: CN116821053A

Abstract

The application relates to a data reporting method, a device, computer equipment and a storage medium, wherein the data to be reported of a file is obtained, the characteristic value of the data to be reported is extracted, the data to be reported is classified into different barrel files according to the characteristic value and stored, the data to be reported is clustered in the barrel files according to the similarity to obtain a plurality of groups of data clusters to be reported, the data clusters to be reported are scored according to the ratio of normal data samples to malicious data samples under each group of data clusters to be reported, a plurality of groups of data clusters to be reported are selected according to the score to report, repeated or similar useless data reporting is reduced through clustering, the reported data is filtered through scoring, the problem that the reporting efficiency of the file data in the related technology is lower is solved, the space required for storing the data is reduced, and the reporting efficiency of the file data is improved.

Description

Data reporting methods, devices, computer equipment and storage media

技术领域Technical field

本申请涉及数据上报技术领域，特别是涉及一种数据上报方法、装置、计算机设备和存储介质。This application relates to the technical field of data reporting, and in particular to a data reporting method, device, computer equipment and storage medium.

背景技术Background technique

在云场景中有很多服务需要上报云服务器的文件数据，但将云服务器中的海量文件数据全部上报的成本较高，且全量上报需要占用检测引擎更多的资源。因此需要根据不同的应用场景，有选择性地对文件数据进行上报。In the cloud scenario, there are many services that need to report the file data of the cloud server. However, the cost of reporting all the massive file data in the cloud server is high, and reporting the full amount requires more resources of the detection engine. Therefore, file data needs to be reported selectively according to different application scenarios.

现有技术中，对于云场景的文件数据上报，通常根据历史经验数据对文件数据进行选择，从而完成对上报的文件数据的选取。然而这种方法的上报效率较低，且容易忽略重要的文件数据，适用性较低。In the existing technology, for reporting file data in cloud scenarios, file data is usually selected based on historical experience data, thereby completing the selection of reported file data. However, the reporting efficiency of this method is low, and important file data is easily overlooked, so the applicability is low.

目前，对于相关技术中，文件数据上报效率较低的问题，尚未提出有效的解决办法。At present, no effective solution has been proposed for the problem of low file data reporting efficiency in related technologies.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种能够提高文件数据上报效率的数据上报方法、装置、计算机设备和计算机可读存储介质。Based on this, it is necessary to address the above technical problems and provide a data reporting method, device, computer equipment and computer-readable storage medium that can improve the efficiency of file data reporting.

第一方面，本申请提供了一种数据上报方法。所述方法包括：In the first aspect, this application provides a data reporting method. The methods include:

获取文件的待上报数据；Get the data to be reported of the file;

提取所述待上报数据的特征值；Extract the characteristic values of the data to be reported;

根据所述特征值，将所述待上报数据分类至不同的桶文件中进行存储；According to the characteristic value, the data to be reported is classified into different bucket files for storage;

对同一所述桶文件下的所述待上报数据进行相似度计算，根据所述相似度对所述待上报数据进行聚类，生成多组待上报数据簇，其中，所述待上报数据簇下包含正常数据样本与恶意数据样本；Perform similarity calculation on the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, wherein the data clusters to be reported are Contains normal data samples and malicious data samples;

根据每组所述待上报数据簇下所述正常数据样本与所述恶意数据样本的占比，对每组所述待上报数据簇进行评分，根据分数选取多组所述待上报数据簇进行上报。According to the proportion of the normal data samples and the malicious data samples in each group of the data clusters to be reported, each group of the data clusters to be reported is scored, and multiple groups of the data clusters to be reported are selected for reporting based on the scores. .

在其中一个实施例中，所述待上报数据包括：文件上报路径数据与恶意文件路径数据，所述文件上报数据与所述恶意文件路径数据的数据字段包括所述文件的第一识别码、文件名、文件路径、目录以及生成时间。In one embodiment, the data to be reported includes: file reporting path data and malicious file path data, and the data fields of the file reporting data and the malicious file path data include the first identification code of the file, the file path data, and the first identification code of the file. name, file path, directory and generation time.

在其中一个实施例中，提取所述待上报数据的特征值包括：In one embodiment, extracting the feature values of the data to be reported includes:

以第一识别码为主键，提取所述文件的目录；Using the first identification code as the primary key, extract the directory of the file;

对所述目录进行分割，得到多个字节片段；Split the directory to obtain multiple byte fragments;

计算所述多个字节片段的第二识别码，将多个所述第二识别码进行合并，生成所述待上报数据的第一特征值。Calculate the second identification codes of the plurality of byte fragments, combine the plurality of second identification codes, and generate the first characteristic value of the data to be reported.

在其中一个实施例中，提取所述待上报数据的特征值还包括：In one embodiment, extracting the feature values of the data to be reported further includes:

对所述第一特征值的每行进行多次随机打乱；Randomly shuffle each row of the first feature value multiple times;

将每次打乱后得到的所述第一特征值对应的第一集合映射到第二集合中，其中，所述第二集合中的每个映射值之间互不重复；Map the first set corresponding to the first feature value obtained after each shuffling to a second set, wherein each mapped value in the second set does not overlap with each other;

对所述第二集合中的映射值按照从小到大的顺序进行查找，直至所述查找到的所述映射值对应的所述第一特征值为第一预设值；Search the mapping values in the second set in order from small to large until the first characteristic value corresponding to the found mapping value is the first preset value;

获取查找到的所述映射值对应的位数编号，将多个所述位数编号进行合并，得到第二特征值；Obtain the number of digits corresponding to the found mapping value, and combine multiple of the number of digits to obtain the second feature value;

根据所述第二特征值，对所述待上报数据进行分类。Classify the data to be reported according to the second characteristic value.

在其中一个实施例中，在所述桶文件中对所述待上报数据进行聚类包括：In one embodiment, clustering the data to be reported in the bucket file includes:

选取一条未聚类的所述待上报数据，计算选取的所述待上报数据与同一所述桶文件中的已聚类的多组所述待上报数据簇的相似度；Select a piece of unclustered data to be reported, and calculate the similarity between the selected data to be reported and multiple groups of clustered data clusters to be reported in the same bucket file;

当所述相似度大于第一阈值时，将选取的所述待上报数据归并至相似的所述待上报数据簇中；When the similarity is greater than the first threshold, merge the selected data to be reported into similar data clusters to be reported;

当所述相似度小于第一阈值时，将选取的所述待上报数据新建为一个所述待上报数据簇。When the similarity is less than the first threshold, the selected data to be reported is newly created into a data cluster to be reported.

在其中一个实施例中，根据分数选取多组所述待上报数据簇进行上报包括：In one embodiment, selecting multiple groups of data clusters to be reported for reporting based on scores includes:

选取前N个分数最高的所述待上报数据簇进行上报，或者，选取分数超过第二阈值的所述待上报数据簇进行上报。The first N data clusters to be reported with the highest scores are selected for reporting, or the data clusters to be reported with scores exceeding the second threshold are selected for reporting.

在其中一个实施例中，将所述待上报数据簇进行上报包括：In one embodiment, reporting the data cluster to be reported includes:

根据所述待上报数据簇中的所述待上报数据的路径，将所述待上报数据切分为多个目录名称；Divide the data to be reported into multiple directory names according to the path of the data to be reported in the data cluster to be reported;

根据所述目录名称，对所述路径进行正则替代，并计算替代后所述路径的合并程度；According to the directory name, perform regular substitution on the path, and calculate the degree of merging of the path after substitution;

当所述合并程度低于第三阈值时，继续对所述路径进行正则替代，直至替代后所述路径的合并程度高于第三阈值；When the degree of merging is lower than the third threshold, continue regular replacement of the path until the degree of merging of the path after substitution is higher than the third threshold;

提取合并路径的正则表达式，计算所述正则表达式对于对应的所述待上报数据的覆盖率，以及对于所述待上报数据簇中所有所述待上报数据的全局覆盖率；Extract the regular expression of the merge path, calculate the coverage of the regular expression for the corresponding data to be reported, and the global coverage of all the data to be reported in the data cluster to be reported;

根据所述覆盖率与所述全局覆盖率，选取所述正则表达式，根据选取的所述正则表达式对所述待上报数据簇进行上报。According to the coverage rate and the global coverage rate, the regular expression is selected, and the data cluster to be reported is reported according to the selected regular expression.

第二方面，本申请还提供了一种数据上报装置。所述装置包括：In a second aspect, this application also provides a data reporting device. The device includes:

获取模块，用于获取文件的待上报数据；The acquisition module is used to obtain the data to be reported in the file;

提取模块，用于提取所述待上报数据的特征值；An extraction module, used to extract the characteristic values of the data to be reported;

分桶模块：用于根据所述特征值，将所述待上报数据分类至不同的桶文件中进行存储；Bucketing module: used to classify the data to be reported into different bucket files for storage according to the characteristic value;

聚类模块，用于对同一所述桶文件下的所述待上报数据进行相似度计算，根据所述相似度对所述待上报数据进行聚类，生成多组待上报数据簇，其中，所述待上报数据簇下包含正常数据样本与恶意数据样本；A clustering module is used to calculate the similarity of the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, where all The data cluster to be reported contains normal data samples and malicious data samples;

评分模块，用于根据每组所述待上报数据簇下所述正常数据样本与所述恶意数据样本的占比，对每组所述待上报数据簇进行评分，根据分数选取多组所述待上报数据簇进行上报。A scoring module is used to score each group of the data clusters to be reported based on the proportion of the normal data samples and the malicious data samples in each group of the data clusters to be reported, and select multiple groups of the data clusters to be reported based on the scores. Report data clusters for reporting.

第三方面，本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In a third aspect, this application also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

获取文件的待上报数据；Get the data to be reported of the file;

第四方面，本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：In a fourth aspect, this application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by the processor, the following steps are implemented:

获取文件的待上报数据；Get the data to be reported of the file;

上述数据上报方法、装置、计算机设备和存储介质，通过获取文件的待上报数据，提取待上报数据的特征值，根据特征值将待上报数据分类至不同的桶文件中进行存储，在桶文件中根据相似度对待上报数据进行聚类，得到多组待上报数据簇，根据每组待上报数据簇下正常数据样本与恶意数据样本的占比，对每组待上报数据簇进行评分，根据分数选取多组待上报数据簇进行上报，通过聚类减少了重复或者相似的无用数据的上报，通过评分对于上报的数据进行了过滤，解决了相关技术中文件数据上报效率较低的问题，降低存储数据需要的空间，提升了文件数据上报的效率。The above data reporting method, device, computer equipment and storage medium obtain the data to be reported from the file, extract the characteristic values of the data to be reported, and classify the data to be reported into different bucket files for storage according to the characteristic values. In the bucket file The data to be reported are clustered according to the similarity to obtain multiple groups of data clusters to be reported. According to the proportion of normal data samples and malicious data samples in each group of data clusters to be reported, each group of data clusters to be reported is scored and selected based on the scores. Multiple groups of data clusters to be reported are reported, and clustering is used to reduce the reporting of duplicate or similar useless data. The reported data is filtered through scoring, which solves the problem of low file data reporting efficiency in related technologies and reduces the storage of data. The required space improves the efficiency of file data reporting.

附图说明Description of the drawings

图1为一个实施例中数据上报方法的应用环境图；Figure 1 is an application environment diagram of the data reporting method in one embodiment;

图2为一个实施例中数据上报方法的流程示意图；Figure 2 is a schematic flow chart of a data reporting method in one embodiment;

图3为一个实施例中数据上报方法的第一特征值计算流程图；Figure 3 is a flow chart of the first feature value calculation of the data reporting method in one embodiment;

图4为一个实施例中数据上报方法的第二特征值计算流程图；Figure 4 is a flow chart of the second characteristic value calculation of the data reporting method in one embodiment;

图5为一个实施例中数据上报方法的分桶聚类流程图；Figure 5 is a bucket clustering flow chart of the data reporting method in one embodiment;

图6为一个实施例中数据上报方法的正则表达式提取图；Figure 6 is a regular expression extraction diagram of the data reporting method in one embodiment;

图7为一个实施例中数据上报方法的整体流程图；Figure 7 is an overall flow chart of a data reporting method in one embodiment;

图8为一个实施例中数据上报装置的结构框图；Figure 8 is a structural block diagram of a data reporting device in one embodiment;

图9为一个实施例中计算机设备的内部结构图。Figure 9 is an internal structure diagram of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

本申请实施例提供的数据上报方法，可以应用于如图1所示的应用环境中。其中，终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上，也可以放在云上或其他网络服务器上。其中，终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The data reporting method provided by the embodiment of this application can be applied in the application environment as shown in Figure 1. Among them, the terminal 102 communicates with the server 104 through the network. The data storage system may store data that server 104 needs to process. The data storage system can be integrated on the server 104, or placed on the cloud or other network servers. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, etc. The server 104 can be implemented as an independent server or a server cluster composed of multiple servers.

在一个实施例中，如图2所示，提供了一种数据上报方法，以该方法应用于图1中的终端为例进行说明，包括以下步骤：In one embodiment, as shown in Figure 2, a data reporting method is provided. This method is explained by taking the method applied to the terminal in Figure 1 as an example, and includes the following steps:

步骤S202，获取文件的待上报数据。Step S202: Obtain the data to be reported of the file.

其中，文件的待上报数据包括文件名、文件路径、是否为恶意文件、文件生成时间等。在获取文件的待上报数据后，还会对待上报数据进行数据清洗，以清除数据中的脏数据。脏数据包括但不限于空数据、路径数据不完整、乱码等各种类型的不正常数据。Among them, the data to be reported of the file includes the file name, file path, whether it is a malicious file, file generation time, etc. After obtaining the data to be reported in the file, the data to be reported will also be cleaned to remove dirty data in the data. Dirty data includes but is not limited to empty data, incomplete path data, garbled data and other types of abnormal data.

步骤S204，提取待上报数据的特征值。Step S204: Extract feature values of the data to be reported.

其中，根据获取到的待上报数据提取特征，特征包括文件目录、目录在设定周期内全局中产生的文件数、恶意和正常文件的数量以及占总文件数的比例等。其中，周期的设定可以根据实际情况进行调整，包括但不限于周、天、小时。Among them, features are extracted based on the obtained data to be reported. Features include file directories, the number of files generated globally in the directory within a set period, the number of malicious and normal files, and their proportion to the total number of files. Among them, the cycle setting can be adjusted according to the actual situation, including but not limited to weeks, days, and hours.

步骤S206，根据特征值，将待上报数据分类至不同的桶文件中进行存储。Step S206: Classify the data to be reported into different bucket files for storage according to the characteristic values.

其中，以云场景为例，每天上报的文件数据量在100亿级，因此将文件数据全量上报会导致上报效率很低，且浪费大量资源。所以需要对待上报数据进行筛选，选择不重复且合适的数据进行上报。因此，本申请实施例先根据待上报数据的特征值，对待上报数据进行分桶处理，将可能相似的数据放入同一个桶文件中。Among them, taking the cloud scenario as an example, the amount of file data reported every day is on the order of 10 billion. Therefore, reporting all file data will result in very low reporting efficiency and waste a lot of resources. Therefore, it is necessary to filter the data to be reported and select non-duplicate and appropriate data for reporting. Therefore, in this embodiment of the present application, the data to be reported is first divided into buckets based on the characteristic values of the data to be reported, and possibly similar data is put into the same bucket file.

步骤S208，对同一桶文件下的待上报数据进行相似度计算，根据相似度对待上报数据进行聚类，生成多组待上报数据簇，其中，待上报数据簇下包含正常数据样本与恶意数据样本。Step S208: Calculate the similarity of the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, where the data clusters to be reported include normal data samples and malicious data samples. .

其中，待上报数据经过分桶处理后，同一桶文件内的数据才有计算是否相似的机会，不同桶文件的数据被认为是不相似的数据。在分布式环境中，同一个桶文件内的待上报数据会被分配到同一个分布式计算节点。在桶文件中，所有数据依次逐条处理进行聚类，得到多组待上报数据簇。聚类完成后，每组待上报数据簇下样本的目录结构是类似的，包括正常数据样本与恶意数据样本。Among them, only after the reported data has been divided into buckets, the data in the same bucket file will have the opportunity to calculate whether it is similar, and the data in different bucket files are considered to be dissimilar data. In a distributed environment, data to be reported in the same bucket file will be allocated to the same distributed computing node. In the bucket file, all data are processed one by one for clustering, and multiple groups of data clusters to be reported are obtained. After clustering is completed, the directory structure of the samples under each group of data clusters to be reported is similar, including normal data samples and malicious data samples.

步骤S210，根据每组待上报数据簇下正常数据样本与恶意数据样本的占比，对每组待上报数据簇进行评分，根据分数选取多组待上报数据簇进行上报。Step S210: Score each group of data clusters to be reported based on the proportion of normal data samples and malicious data samples in each group of data clusters to be reported, and select multiple groups of data clusters to be reported for reporting based on the scores.

其中，在得到多组待上报数据簇后，根据每组待上报数据簇下正常数据样本与恶意数据样本的数量、比例以及目录数量等信息，计算当前待上报数据簇的分数，该分数决定了对应的待上报数据簇在所有数据簇中的排名，分数越高，排名越靠前，说明对应的待上报数据簇越应该被上报。Among them, after obtaining multiple groups of data clusters to be reported, the score of the current data cluster to be reported is calculated based on the number, proportion, and number of directories of normal data samples and malicious data samples in each group of data clusters to be reported. This score determines The ranking of the corresponding data cluster to be reported among all data clusters. The higher the score, the higher the ranking, which means that the corresponding data cluster to be reported should be reported.

示例性地，在恶意文件检测场景中，上报的原则是当前待上报数据簇代表的样本越多，并且待上报数据簇中的恶意数据样本越多，该数据簇就越值得被上报，反之则不值得被上报，这是因为在恶意文件检测场景中，后台数据需要更多关注有恶意文件上报的目录，对于没有恶意文件上报的目录尽量减少上报，以节省资源和提升检测服务性能。此时，对于待上报数据簇的评分公式为：恶意数据样本数×M–正常数据样本数×0.1/(目录数×10)。其中，M为超参数，可以根据实际情况进行调整，默认的超参数为100。For example, in a malicious file detection scenario, the reporting principle is that the more samples the current data cluster to be reported represents, and the more malicious data samples there are in the data cluster to be reported, the more worthy the data cluster is of being reported, and vice versa. It is not worth reporting. This is because in the malicious file detection scenario, the background data needs to pay more attention to the directories with reported malicious files, and minimize the reporting of directories without reported malicious files to save resources and improve detection service performance. At this time, the scoring formula for the data cluster to be reported is: number of malicious data samples × M – number of normal data samples × 0.1/(number of directories × 10). Among them, M is a hyperparameter, which can be adjusted according to the actual situation. The default hyperparameter is 100.

上述数据上报方法中，在进行聚类时并没有直接对所有待上报数据进行聚类，而是先提取了待上报数据的特征值，根据特征值对待上报数据进行了一次初步的筛选，将可能相似的待上报数据放入同一桶文件中，同一桶文件中的待上报数据才会进行相似度计算与聚类，提高了聚类的计算效率。对同一桶文件下的待上报数据进行相似度计算，根据相似度对待上报数据进行聚类，生成多组待上报数据簇，根据待上报数据簇中恶意数据样本与正常数据样本的占比与数量进行评分，根据分数选择数据簇进行上报，大量减少了重复或相似的无用数据的上报，通过评分对于上报的数据进行了过滤，解决了现有技术中数据上报效率较低的问题，提高了数据上报的效率，降低了存储数据需要的空间。In the above data reporting method, when clustering, all the data to be reported are not directly clustered. Instead, the characteristic values of the data to be reported are first extracted, and a preliminary screening of the data to be reported is performed based on the characteristic values, so as to make it possible to Similar data to be reported is put into the same bucket file, and then similarity calculation and clustering are performed on the data to be reported in the same bucket file, which improves the calculation efficiency of clustering. Calculate the similarity of the data to be reported under the same bucket file, cluster the data to be reported based on the similarity, and generate multiple groups of data clusters to be reported. According to the proportion and number of malicious data samples and normal data samples in the data clusters to be reported, Scoring is performed, and data clusters are selected for reporting based on the scores, which greatly reduces the reporting of duplicate or similar useless data. The reported data is filtered through scoring, which solves the problem of low data reporting efficiency in the existing technology and improves data efficiency. The efficiency of reporting reduces the space required to store data.

在一个实施例中，待上报数据包括：文件上报路径数据与恶意文件路径数据，文件上报数据与恶意文件路径数据的数据字段包括文件的第一识别码、文件名、文件路径、目录以及生成时间。In one embodiment, the data to be reported includes: file reporting path data and malicious file path data. The data fields of the file reporting data and malicious file path data include the first identification code, file name, file path, directory and generation time of the file. .

其中，待上报数据中包括正常的文件上报路径数据以及恶意文件路径数据，第一识别码包括正常文件以及恶意文件的md5，文件的md5指的是通过md5加密算法对文件进行处理生成的哈希值，是文件的唯一识别码。对于不同的文件，如果第一识别码的值相同，则代表文件是相同的，反之则不相同。如果文件发生修改，则其第一识别码的值也随之发生改变。Among them, the data to be reported includes normal file reporting path data and malicious file path data. The first identification code includes the md5 of normal files and malicious files. The md5 of the file refers to the hash generated by processing the file through the md5 encryption algorithm. The value is the unique identification code of the file. For different files, if the values of the first identification codes are the same, it means that the files are the same, and vice versa. If the file is modified, the value of its first identification code also changes accordingly.

本实施例中，通过获取待上报数据，获取的待上报数据存储于数据平台中作为原始数据，为后续的数据处理与上报提供数据支撑。In this embodiment, by obtaining the data to be reported, the obtained data to be reported is stored in the data platform as original data to provide data support for subsequent data processing and reporting.

在一个实施例中，提取待上报数据的特征值包括：以第一识别码为主键，提取文件的目录。对目录进行分割，得到多个字节片段。计算多个字节片段的第二识别码，将多个第二识别码进行合并，生成待上报数据的第一特征值。In one embodiment, extracting the characteristic value of the data to be reported includes: using the first identification code as the main key to extract the directory of the file. Split the directory to obtain multiple byte fragments. Calculate the second identification codes of the multiple byte fragments, combine the multiple second identification codes, and generate the first characteristic value of the data to be reported.

其中，以第一识别码作为主键，对文件的待上报数据进行预处理，包括对文件路径进行分割。本申请实施例中，采用的路径分割方法为按路径分隔符分割和n garm分割，ngarm是一种基于统计语言模型的算法，它的基本思想是将文本里面的内容按照字节进行大小为n的滑动窗口操作，形成了长度是n的字节片段序列。其中，n的大小可以根据实际情况进行调整。对分割完成的文件路径计算第二识别码，即分割后文件路径的md5，将生成的N个第二识别码合并生成simhash，即第一特征值。Simhash是一种指纹生成算法，可以对文本进行降维处理，得到一个simhash值，通过比较不同文本的simhash值，可以判断文本之间的相似度。Among them, the first identification code is used as the primary key to preprocess the data to be reported of the file, including segmenting the file path. In the embodiment of this application, the path segmentation methods used are segmentation by path separators and n garm segmentation. ngarm is an algorithm based on a statistical language model. Its basic idea is to divide the content in the text into bytes with a size of n. The sliding window operation forms a byte fragment sequence of length n. Among them, the size of n can be adjusted according to the actual situation. Calculate the second identification code for the divided file path, that is, the md5 of the divided file path, and combine the generated N second identification codes to generate simhash, which is the first feature value. Simhash is a fingerprint generation algorithm that can perform dimensionality reduction on text to obtain a simhash value. By comparing the simhash values of different texts, the similarity between texts can be determined.

示例性地，图3是本申请实施例的数据上报方法的第一特征值计算流程图，如图3所示，以2 garm为例，将文件的第一识别码作为主键，获取文件的目录，将目录进行分割，得到多个字节片段作为分割结果，如/u、us、le、et等。对于上述分割得到的字节片段计算md5，得到N个第二识别码。将N个第二识别码按位合并，生成simhash，即第一特征值。Exemplarily, Figure 3 is a first feature value calculation flow chart of the data reporting method according to the embodiment of the present application. As shown in Figure 3, taking 2 garm as an example, the first identification code of the file is used as the primary key to obtain the directory of the file. , split the directory and obtain multiple byte fragments as the split results, such as /u, us, le, et, etc. Calculate md5 for the byte fragments obtained by the above division, and obtain N second identification codes. Combine N second identification codes bit by bit to generate simhash, which is the first feature value.

本实施例中，通过对文件路径进行分割并计算识别码，完成了数据的第二特征值的获取，便于后续的相似度计算。采用simhash方法，对文件数据的顺序的敏感度更低，且对文件数据的分类精度更高。In this embodiment, by segmenting the file path and calculating the identification code, the second feature value of the data is obtained, which facilitates subsequent similarity calculation. Using the simhash method, the sensitivity to the order of file data is lower, and the classification accuracy of file data is higher.

在一个实施例中，提取待上报数据的特征值还包括：对第一特征值的每行进行多次随机打乱。将每次打乱后得到的第一特征值对应的第一集合映射到第二集合中，其中，第二集合中的每个映射值之间互不重复。对第二集合中的映射值按照从小到大的顺序进行查找，直至查找到映射值对应的第一特征值为第一预设值。获取查找到的映射值对应的位数编号，将多个位数编号进行合并，得到第二特征值。根据第二特征值，对待上报数据进行分类。In one embodiment, extracting the feature values of the data to be reported further includes: randomly shuffling each row of the first feature value multiple times. The first set corresponding to the first feature value obtained after each shuffling is mapped to the second set, where each mapped value in the second set does not overlap with each other. The mapping values in the second set are searched in ascending order until the first characteristic value corresponding to the mapping value is found to be the first preset value. Obtain the digit number corresponding to the found mapping value, and combine multiple digit numbers to obtain the second feature value. Classify the data to be reported according to the second characteristic value.

其中，第二特征值为minhash值，Minhash算法为最小哈希函数算法，对一个列向量按行进行随机排列，重排后第一个非零元素的行号就是最小哈希函数值。在本申请实施例中，计算得到第一特征值后，将第一特征值的列向量中每行随机打乱，进行随机排列。将随机排列得到的第一特征值对应的第一集合映射到第二集合中，该映射关系符合完美哈希函数，完美哈希函数是指将集合S的每个元素映射到另一系列无冲突的集合的哈希函数，例如集合{0,1,2,3,4,5,6}被映射到新集合{3,2,5,1,0,6,4}中，新的集合每个数没有重复，就表示该函数是完全hash函数。本申请中，得到的第二集合中的各个映射值互不重复，对第二集合中的各个映射值按照从小到大的顺序进行查找，直到查找到的映射值对应的第一特征值的列向量对应的该行数值为第一预设值，本申请实施例中，第一预设值为1。在按照从小到大的顺序查找到第一集合中出现第一个第一预设值后，取该第一预设值的位置对应的二进制位数编号。由于进行了多次随机，将多次随机后映射查找得到的二进制位数编号进行合并，合并结果即为第二特征值。上述取随机后映射查找并得到一个二进制位数编号的过程即为求minhash过程。求取minhash的次数决定了分桶的大小，次数越多，合并得到的第二特征值越长，桶文件的数量越多，每个桶文件中的待上报数据就越少，但是聚类的召回率会随之降低。次数越多，合并得到的第二特征值越短，聚类的召回率越高。通常情况下，取2-3次minhash即可满足10亿数量级的聚类需求。根据第二特征值，可以对待上报数据进行分类，将待上报数据分类至不同的桶文件中。分桶根据第二特征值计算得到的ID，例如形式可以是12_73_51，所有ID为12_73_51的数据会被分到同一个桶文件中。Among them, the second eigenvalue is the minhash value, and the Minhash algorithm is the minimum hash function algorithm. A column vector is randomly arranged by row, and the row number of the first non-zero element after rearrangement is the minimum hash function value. In the embodiment of the present application, after the first eigenvalue is calculated, each row in the column vector of the first eigenvalue is randomly scrambled and randomly arranged. The first set corresponding to the first eigenvalue obtained by random arrangement is mapped to the second set. The mapping relationship conforms to the perfect hash function. The perfect hash function refers to mapping each element of the set S to another series without conflict. The hash function of the set, for example, the set {0,1,2,3,4,5,6} is mapped to the new set {3,2,5,1,0,6,4}, and each new set If the number is not repeated, it means that the function is a complete hash function. In this application, each mapping value in the obtained second set does not repeat each other. Each mapping value in the second set is searched in order from small to large until the column of the first feature value corresponding to the found mapping value. The value in the row corresponding to the vector is the first preset value. In the embodiment of this application, the first preset value is 1. After finding the first first preset value appearing in the first set in order from small to large, the binary digit number corresponding to the position of the first preset value is obtained. Since multiple randomizations are performed, the binary digit numbers obtained by the mapping search after multiple randomizations are merged, and the merged result is the second eigenvalue. The above-mentioned process of mapping and searching after randomization and obtaining a binary number is the minhash process. The number of times minhash is obtained determines the size of the bucket. The more times, the longer the second feature value obtained by merging, the greater the number of bucket files, and the less data to be reported in each bucket file, but the clustering The recall rate will then decrease. The more times, the shorter the second feature value obtained by merging, and the higher the recall rate of clustering. Under normal circumstances, taking 2-3 minhash times can meet the clustering requirements of the order of 1 billion. According to the second characteristic value, the data to be reported can be classified, and the data to be reported can be classified into different bucket files. Bucketing is based on the ID calculated from the second feature value. For example, the format can be 12_73_51. All data with ID 12_73_51 will be divided into the same bucket file.

示例性地，图4是本申请一个实施例的数据上报方法的第二特征值计算流程图，如图4所示，A列是第一特征值的二进制表示，B列是第一特征值的二进制位编号，C列是通过完美哈希函数映射后的二进制位数编号，以取两次minhash为例，第一次中，按照C列编号从小到大进行查找，A列出现第一个1时，对应的B列的二进制位编号为5，则此次结果为5。第二次中，按照C列编号从小到大进行查找，A列出现第一个1时，对应的B列的二进制位编号为3，则此次结果为3。将两次结果合并取minhash，最终得到的第二特征值为5-1。Exemplarily, Figure 4 is a second feature value calculation flow chart of the data reporting method according to an embodiment of the present application. As shown in Figure 4, column A is the binary representation of the first feature value, and column B is the binary representation of the first feature value. Binary digit number. Column C is the binary digit number mapped by the perfect hash function. Take two minhash as an example. In the first time, search according to the number in column C from small to large. The first 1 appears in column A. When , the corresponding binary digit number in column B is 5, then the result this time is 5. In the second time, the search is performed according to the numbers in column C from small to large. When the first 1 appears in column A, the corresponding binary number in column B is 3, so the result this time is 3. Combine the two results to get the minhash, and the final second eigenvalue is 5-1.

本实施例中，通过对第一特征值进行完美哈希函数映射后取minhash的方式得到第二特征值，对于大量的数据，采用minhash的算法计算效率较高，且对于数据的顺序没有强制要求，适用范围更广。对于数据进行分桶的预处理，能够提高后续的聚类效率。In this embodiment, the second eigenvalue is obtained by performing a perfect hash function mapping on the first eigenvalue and then taking a minhash. For a large amount of data, the minhash algorithm is more efficient in calculation, and there is no mandatory requirement for the order of the data. , has a wider scope of application. Bucketing preprocessing of data can improve subsequent clustering efficiency.

在一个实施例中，在桶文件中对待上报数据进行聚类包括：选取一条未聚类的待上报数据，计算选取的待上报数据与同一桶文件中的已聚类的多组待上报数据簇的相似度。当相似度大于第一阈值时，将选取的待上报数据归并至相似的待上报数据簇中。当相似度小于第一阈值时，将选取的待上报数据新建为一个待上报数据簇。In one embodiment, clustering the data to be reported in the bucket file includes: selecting a piece of unclustered data to be reported, and calculating the difference between the selected data to be reported and multiple groups of clustered data to be reported in the same bucket file. similarity. When the similarity is greater than the first threshold, the selected data to be reported are merged into similar data clusters to be reported. When the similarity is less than the first threshold, the selected data to be reported is newly created into a data cluster to be reported.

其中，根据第二特征值对待上报数据进行分桶后，同一个桶文件中的数据才有计算是否相似的机会，不同桶文件的数据被认为是不相似的数据。在分布式环境下同一个桶文件内的数据会被分配到同一个分布式计算节点（worker）。在同一个桶文件内将所有待上报数据进行逐条处理，当某一条待上报数据准备进行聚类时，首先计算已经聚类完成的待上报数据簇中是否存在与之相似的数据簇，如果存在，则将该条待上报数据归并到与之相似的待上报数据簇中；如果不存在，则将其新建为一个单独的数据簇。对所述待上报数据依次逐条处理后，完成该桶文件内待上报数据的聚类。Among them, only after the data to be reported is divided into buckets according to the second characteristic value, the data in the same bucket file has the opportunity to calculate whether it is similar, and the data in different bucket files are considered to be dissimilar data. In a distributed environment, data in the same bucket file will be distributed to the same distributed computing node (worker). Process all the data to be reported one by one in the same bucket file. When a certain piece of data to be reported is ready for clustering, first calculate whether there are similar data clusters in the clustered data clusters to be reported. If so, , then merge the piece of data to be reported into a similar data cluster to be reported; if it does not exist, create it as a separate data cluster. After processing the data to be reported one by one, the clustering of the data to be reported in the bucket file is completed.

示例性地，图5是本申请一个实施例的数据上报方法的分桶聚类流程图，如图5所示，通过对文件目录路径的分割生成第一特征值，通过第一特征值计算得到第二特征值，根据第二特征值对待上报数据进行分桶，分为桶1至桶N，每个桶文件内分别进行聚类，最终将各个桶文件内聚类得到的结果进行合并，得到最终的聚类结果。Exemplarily, Figure 5 is a bucket clustering flow chart of a data reporting method according to an embodiment of the present application. As shown in Figure 5, the first feature value is generated by dividing the file directory path, and is calculated through the first feature value. Second eigenvalue, the data to be reported is divided into buckets according to the second eigenvalue, divided into bucket 1 to bucket N, clustering is performed in each bucket file, and finally the results obtained by clustering in each bucket file are merged to obtain The final clustering result.

本实施例中，将同一个桶内的数据分配到同一个分布式计算节点进行聚类计算，且采用层次聚类的方法，根据数据是否相似进行合并聚合，提升了聚类的效率。In this embodiment, data in the same bucket is assigned to the same distributed computing node for clustering calculation, and a hierarchical clustering method is used to merge and aggregate the data based on whether the data is similar, thereby improving the efficiency of clustering.

在一个实施例中，根据分数选取多组待上报数据簇进行上报包括：选取前N个分数最高的待上报数据簇进行上报，或者，选取分数超过第二阈值的待上报数据簇进行上报。In one embodiment, selecting multiple groups of data clusters to be reported based on scores includes: selecting the top N data clusters with the highest scores to be reported, or selecting data clusters with scores exceeding a second threshold to be reported.

其中，在聚类完成后，每个待上报数据簇表示一个类，计算完成每个待上报数据簇的分数后，根据分数对待上报数据簇进行上报可以采用两种机制，一种是按照分数TopN排序的方法，选取前N个分数最高的待上报数据簇进行上报；另一种则是设定一个第二阈值，分数超过第二阈值的待上报数据簇都会被上报。Among them, after clustering is completed, each data cluster to be reported represents a class. After the score of each data cluster to be reported is calculated, two mechanisms can be used to report the data cluster to be reported based on the score. One is to use the score TopN The sorting method selects the top N data clusters with the highest scores to be reported for reporting; the other method is to set a second threshold, and all data clusters to be reported with scores exceeding the second threshold will be reported.

本实施例中，根据恶意数据的占比进行评分，从而决定数据是否上报，采用两种选取机制对待上报数据进行选择，有利于对于恶意文件数据的筛选，节省了计算资源，提升了检测服务的性能。In this embodiment, scoring is performed based on the proportion of malicious data to determine whether to report the data. Two selection mechanisms are used to select the data to be reported, which is conducive to filtering malicious file data, saves computing resources, and improves the efficiency of the detection service. performance.

在一个实施例中，将待上报数据簇进行上报包括：根据待上报数据簇中的待上报数据的路径，将待上报数据切分为多个目录名称。根据目录名称，对路径进行正则替代，并计算替代后路径的合并程度。当合并程度低于第三阈值时，继续对路径进行正则替代，直至替代后路径的合并程度高于第三阈值。提取合并路径的正则表达式，计算正则表达式对于对应的待上报数据的覆盖率，以及对于待上报数据簇中所有待上报数据的全局覆盖率。根据覆盖率与全局覆盖率，选取正则表达式，根据选取的正则表达式对待上报数据簇进行上报。In one embodiment, reporting the data cluster to be reported includes: dividing the data to be reported into multiple directory names according to the path of the data to be reported in the data cluster to be reported. Based on the directory name, perform regular substitution on the path, and calculate the degree of merging of the paths after substitution. When the degree of merging is lower than the third threshold, regular replacement of the path is continued until the degree of merging of the replaced path is higher than the third threshold. Extract the regular expression of the merge path, calculate the coverage of the regular expression for the corresponding data to be reported, and the global coverage of all the data to be reported in the data cluster to be reported. According to the coverage rate and global coverage rate, a regular expression is selected, and the data cluster to be reported is reported based on the selected regular expression.

其中，对被决定上报的数据簇中所有样本的路径提取正则表达式。首先将目录按路径分隔符（例如“/”）切分为N个目录名称，然后根据目录名称计算正则，将目录中的数字替换为正则\d+，计算替换后的完整路径的合并程度，如果替换完成后，该数据簇下样本的目录合并程度低于第三阈值，则需要进一步使用正则替代，例如将切分后N个目录中存在的“_”、“-”后面替换为\w+，然后进一步合并簇下每个样本的完整目录。提取正则表达式后，计算该正则表达式对当前数据簇中样本的覆盖率以及全局覆盖率，选取簇内覆盖率高且全局覆盖率低的正则表达式，根据选取的正则表达式对待上报数据进行上报。Among them, regular expressions are extracted from the paths of all samples in the data cluster that is decided to be reported. First, divide the directory into N directory names according to the path separator (such as "/"), then calculate the regular pattern based on the directory name, replace the numbers in the directory with the regular pattern \d+, and calculate the degree of merging of the replaced full paths. If After the replacement is completed, if the directory merging degree of the samples under this data cluster is lower than the third threshold, you need to further use regular replacement, for example, replace the "_" and "-" that exist in the N directories after segmentation with \w+. Then further merge the complete catalog of each sample under the cluster. After extracting the regular expression, calculate the coverage of the regular expression on the samples in the current data cluster and the global coverage, select the regular expression with high coverage within the cluster and low global coverage, and treat the reported data according to the selected regular expression Make a report.

示例性地，图6是本申请一个实施例的数据上报方法的正则表达式提取图，如图6所示，首先将目录按路径分隔符（例如“/”）切分为N个目录名称，以/var/www/sites/exam.com/up/16/propic_0/1209923724为例，切分结果分别为var、www、sites、exam.com、up、16、propic_0、1209923724。根据目录名称计算正则，将目录中的数字替换为正则\d+，16换成\d+，120992372换成\d+。计算替换后的完整路径的合并程度，第一轮合并后目录为：/var/www/sites/exam.com/up/\d+/propic_0/\d+。如果替换完目录后，簇下样本的目录合并程序低于第三阈值，则需要进一步使用正则替代，例如将切分后N个目录中存在的“_”、“-”后面替换为\w+，然后进一步合并簇下每个样本的完整目录。最终合并后目录正则表达式为：/var/www/sites/exam.com/up/\d+/ propic_\w+/\d+。Exemplarily, Figure 6 is a regular expression extraction diagram of a data reporting method according to an embodiment of the present application. As shown in Figure 6, the directory is first divided into N directory names according to path separators (such as "/"). Taking /var/www/sites/exam.com/up/16/propic_0/1209923724 as an example, the segmentation results are var, www, sites, exam.com, up, 16, propic_0, 1209923724. Calculate the regular pattern based on the directory name, replace the numbers in the directory with the regular pattern \d+, replace 16 with \d+, and replace 120992372 with \d+. Calculate the degree of merging of the replaced full paths. The directory after the first round of merging is: /var/www/sites/exam.com/up/\d+/propic_0/\d+. If after replacing the directories, the directory merging program of the samples under the cluster is lower than the third threshold, you need to further use regular replacement, for example, replace the "_" and "-" existing in the N directories after splitting with \w+. Then further merge the complete catalog of each sample under the cluster. The final merged directory regular expression is: /var/www/sites/exam.com/up/\d+/ propic_\w+/\d+.

图7为本申请一个实施例的数据上报方法的整体流程图，如图7所示，首先收集现有文件待上报数据，收集后进行数据清洗和特征提取，根据提取到的特征与文件路径，对待上报数据进行聚类，计算并分析聚类结果，根据聚类结果选取需要上报的文件数据，当该文件需要被上报时，提取文件上报路径的正则表达式，根据正则表达式进行上报。通过上述过程，可以大量减少重复或者相似的无用日志的上报，提升数据上报的效率，同时可以减少存储上报数据需要的空间，适用性较高。另外，本申请实施例的数据上报方法，不仅可以用于文件日志数据的上报，也可以用于系统进程等用户以及操作系统日志数据的上报，本申请对此不作限定。Figure 7 is an overall flow chart of a data reporting method according to an embodiment of the present application. As shown in Figure 7, first collect data from existing files to be reported, and then perform data cleaning and feature extraction after collection. According to the extracted features and file paths, Cluster the data to be reported, calculate and analyze the clustering results, select the file data that needs to be reported based on the clustering results, when the file needs to be reported, extract the regular expression of the file reporting path, and report based on the regular expression. Through the above process, the reporting of duplicate or similar useless logs can be greatly reduced, the efficiency of data reporting can be improved, and the space required for storing reported data can be reduced, making the applicability higher. In addition, the data reporting method in the embodiment of the present application can be used not only for reporting file log data, but also for reporting system process and other user and operating system log data. This application is not limited to this.

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的数据上报方法的数据上报装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个数据上报装置实施例中的具体限定可以参见上文中对于数据上报方法的限定，在此不再赘述。Based on the same inventive concept, embodiments of the present application also provide a data reporting device for implementing the above-mentioned data reporting method. The solution to the problem provided by this device is similar to the solution recorded in the above method. Therefore, for the specific limitations in one or more embodiments of the data reporting device provided below, please refer to the limitations on the data reporting method mentioned above. I won’t go into details here.

在一个实施例中，如图8所示，提供了一种数据上报装置，包括：In one embodiment, as shown in Figure 8, a data reporting device is provided, including:

获取模块81，用于获取文件的待上报数据；The acquisition module 81 is used to obtain the data to be reported of the file;

提取模块82，用于提取待上报数据的特征值；The extraction module 82 is used to extract the feature values of the data to be reported;

分桶模块83，用于根据特征值，将待上报数据分类至不同的桶文件中进行存储；The bucketing module 83 is used to classify the data to be reported into different bucket files for storage according to the characteristic values;

聚类模块84，用于对同一桶文件下的待上报数据进行相似度计算，根据相似度对待上报数据进行聚类，生成多组待上报数据簇，其中，待上报数据簇下包含正常数据样本与恶意数据样本；The clustering module 84 is used to calculate the similarity of the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, where the data clusters to be reported contain normal data samples. with malicious data samples;

评分模块85，用于根据每组待上报数据簇下正常数据样本与恶意数据样本的占比，对每组待上报数据簇进行评分，根据分数选取多组待上报数据簇进行上报。The scoring module 85 is used to score each group of data clusters to be reported based on the proportion of normal data samples and malicious data samples in each group of data clusters to be reported, and select multiple groups of data clusters to be reported for reporting based on the scores.

上述数据上报装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned data reporting device can be implemented in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图9所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O）和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储待上报数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据上报方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in Figure 9. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O), and a communication interface. Among them, the processor, memory and input/output interface are connected through the system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems, computer programs and databases. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The database of the computer device is used to store data to be reported. The input/output interface of the computer device is used to exchange information between the processor and external devices. The communication interface of the computer device is used to communicate with an external terminal through a network connection. The computer program implements a data reporting method when executed by the processor.

本领域技术人员可以理解，图9中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Specific computer equipment can May include more or fewer parts than shown, or combine certain parts, or have a different arrangement of parts.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现以下步骤：In one embodiment, a computer device is provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements the following steps:

获取文件的待上报数据；提取待上报数据的特征值；根据特征值，将待上报数据分类至不同的桶文件中进行存储；对同一桶文件下的待上报数据进行相似度计算，根据相似度对待上报数据进行聚类，生成多组待上报数据簇，其中，待上报数据簇下包含正常数据样本与恶意数据样本；根据每组待上报数据簇下正常数据样本与恶意数据样本的占比，对每组待上报数据簇进行评分，根据分数选取多组待上报数据簇进行上报。Obtain the data to be reported from the file; extract the characteristic values of the data to be reported; classify the data to be reported into different bucket files for storage based on the characteristic values; calculate the similarity of the data to be reported under the same bucket file, and calculate the similarity based on the similarity The data to be reported is clustered to generate multiple groups of data clusters to be reported, in which the data clusters to be reported include normal data samples and malicious data samples; according to the proportion of normal data samples and malicious data samples in each group of data clusters to be reported, Score each group of data clusters to be reported, and select multiple groups of data clusters to be reported based on the scores.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：In one embodiment, the processor also implements the following steps when executing the computer program:

获取文件的待上报数据，待上报数据包括：文件上报路径数据与恶意文件路径数据，文件上报数据与恶意文件路径数据的数据字段包括文件的第一识别码、文件名、文件路径、目录以及生成时间。Obtain the data to be reported of the file. The data to be reported includes: file reporting path data and malicious file path data. The data fields of the file reporting data and malicious file path data include the first identification code, file name, file path, directory and generation of the file. time.

以第一识别码为主键，提取文件的目录；对目录进行分割，得到多个字节片段；计算多个字节片段的第二识别码，将多个第二识别码进行合并，生成待上报数据的第一特征值。Using the first identification code as the main key, extract the directory of the file; split the directory to obtain multiple byte fragments; calculate the second identification codes of the multiple byte fragments, merge the multiple second identification codes, and generate a report to be reported The first eigenvalue of the data.

对第一特征值的每行进行多次随机打乱；将每次打乱后得到的第一特征值对应的第一集合映射到第二集合中，其中，第二集合中的每个映射值之间互不重复；对第二集合中的映射值按照从小到大的顺序进行查找，直至查找到的映射值对应的第一特征值为第一预设值；获取查找到的映射值对应的位数编号，将多个位数编号进行合并，得到第二特征值；根据第二特征值，对待上报数据进行分类。Randomly scramble each row of the first feature value multiple times; map the first set corresponding to the first feature value obtained after each scramble to a second set, where each mapped value in the second set do not repeat each other; search the mapping values in the second set in order from small to large until the first feature value corresponding to the found mapping value is the first preset value; obtain the corresponding value of the found mapping value Digit number, multiple digit numbers are combined to obtain the second characteristic value; according to the second characteristic value, the data to be reported is classified.

选取一条未聚类的待上报数据，计算选取的待上报数据与同一桶文件中的已聚类的多组待上报数据簇的相似度；当相似度大于第一阈值时，将选取的待上报数据归并至相似的待上报数据簇中；当相似度小于第一阈值时，将选取的待上报数据新建为一个待上报数据簇。Select a piece of unclustered data to be reported, and calculate the similarity between the selected data to be reported and multiple groups of clustered data clusters to be reported in the same bucket file; when the similarity is greater than the first threshold, the selected data to be reported will be The data is merged into similar data clusters to be reported; when the similarity is less than the first threshold, the selected data to be reported is newly created into a data cluster to be reported.

选取前N个分数最高的待上报数据簇进行上报，或者，选取分数超过第二阈值的待上报数据簇进行上报。The top N data clusters with the highest scores to be reported are selected for reporting, or the data clusters with scores exceeding the second threshold are selected for reporting.

根据待上报数据簇中的待上报数据的路径，将待上报数据切分为多个目录名称；根据目录名称，对路径进行正则替代，并计算替代后路径的合并程度；当合并程度低于第三阈值时，继续对路径进行正则替代，直至替代后路径的合并程度高于第三阈值；提取合并路径的正则表达式，计算正则表达式对于对应的待上报数据的覆盖率，以及对于待上报数据簇中所有待上报数据的全局覆盖率；根据覆盖率与全局覆盖率，选取正则表达式，根据选取的正则表达式对待上报数据簇进行上报。According to the path of the data to be reported in the data cluster to be reported, the data to be reported is divided into multiple directory names; according to the directory name, the path is replaced by regularization, and the degree of merging of the replaced path is calculated; when the degree of merging is lower than the When the third threshold is reached, continue regular replacement of the path until the degree of merging of the replaced path is higher than the third threshold; extract the regular expression of the merged path, calculate the coverage of the regular expression for the corresponding data to be reported, and the coverage of the corresponding data to be reported. The global coverage of all data to be reported in the data cluster; based on the coverage and global coverage, select a regular expression, and report the data cluster to be reported based on the selected regular expression.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer-readable storage medium is provided with a computer program stored thereon. When the computer program is executed by a processor, the following steps are implemented:

在一个实施例中，处理器执行计算机程序时还实现以下步骤：In one embodiment, the processor also performs the following steps when executing the computer program:

需要说明的是，本申请所涉及的用户信息（包括但不限于用户设备信息、用户个人信息等）和数据（包括但不限于用于分析的数据、存储的数据、展示的数据等），均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all It is information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器（Read-OnlyMemory，ROM）、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器（ReRAM）、磁变存储器（Magnetoresistive Random Access Memory，MRAM）、铁电存储器（Ferroelectric RandomAccess Memory，FRAM）、相变存储器（Phase Change Memory，PCM）、石墨烯存储器等。易失性存储器可包括随机存取存储器（Random Access Memory，RAM）或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器（Static Random Access Memory，SRAM）或动态随机存取存储器（Dynamic RandomAccessMemory，DRAM）等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the medium, when the computer program is executed, it may include the processes of the embodiments of the above methods. Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random) Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc. As an illustration and not a limitation, RAM can be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

1. A data reporting method, characterized by including:

Get the data to be reported of the file;

Extract the characteristic values of the data to be reported;

According to the characteristic value, the data to be reported is classified into different bucket files for storage;

Perform similarity calculation on the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, wherein the data clusters to be reported are Contains normal data samples and malicious data samples;

According to the proportion of the normal data samples and the malicious data samples in each group of the data clusters to be reported, each group of the data clusters to be reported is scored, and multiple groups of the data clusters to be reported are selected for reporting based on the scores. .

2. The data reporting method according to claim 1, characterized in that the data to be reported includes: file reporting path data and malicious file path data, and the data fields of the file reporting data and the malicious file path data include The first identification code, file name, file path, directory and generation time of the file.

3. The data reporting method according to claim 1, wherein extracting the feature values of the data to be reported includes:

Using the first identification code as the primary key, extract the directory of the file;

Split the directory to obtain multiple byte fragments;

Calculate the second identification codes of the plurality of byte fragments, combine the plurality of second identification codes, and generate the first characteristic value of the data to be reported.

4. The data reporting method according to claim 3, wherein extracting the characteristic value of the data to be reported further includes:

Randomly shuffle each row of the first feature value multiple times;

Map the first set corresponding to the first feature value obtained after each shuffling to a second set, wherein each mapped value in the second set does not overlap with each other;

Search the mapping values in the second set in order from small to large until the first characteristic value corresponding to the found mapping value is the first preset value;

Obtain the digit number corresponding to the found mapping value, and combine multiple of the digit numbers to obtain the second feature value;

Classify the data to be reported according to the second characteristic value.

5. The data reporting method according to claim 1, wherein clustering the data to be reported in the bucket file includes:

Select a piece of unclustered data to be reported, and calculate the similarity between the selected data to be reported and multiple groups of clustered data clusters to be reported in the same bucket file;

When the similarity is greater than the first threshold, merge the selected data to be reported into similar data clusters to be reported;

When the similarity is less than the first threshold, the selected data to be reported is newly created into a data cluster to be reported.

6. The data reporting method according to claim 1, characterized in that selecting multiple groups of data clusters to be reported according to scores for reporting includes:

The first N data clusters to be reported with the highest scores are selected for reporting, or the data clusters to be reported with scores exceeding the second threshold are selected for reporting.

7. The data reporting method according to claim 1, characterized in that reporting the data cluster to be reported includes:

Divide the data to be reported into multiple directory names according to the path of the data to be reported in the data cluster to be reported;

According to the directory name, perform regular substitution on the path, and calculate the degree of merging of the path after substitution;

When the degree of merging is lower than the third threshold, continue regular replacement of the path until the degree of merging of the path after substitution is higher than the third threshold;

Extract the regular expression of the merge path, calculate the coverage of the regular expression for the corresponding data to be reported, and the global coverage of all the data to be reported in the data cluster to be reported;

According to the coverage rate and the global coverage rate, the regular expression is selected, and the data cluster to be reported is reported according to the selected regular expression.

8. A data reporting device, characterized in that it includes:

The acquisition module is used to obtain the data to be reported in the file;

An extraction module, used to extract the characteristic values of the data to be reported;

Bucketing module: used to classify the data to be reported into different bucket files for storage according to the characteristic value;

A clustering module is used to calculate the similarity of the data to be reported under the same bucket file, cluster the data to be reported according to the similarity, and generate multiple groups of data clusters to be reported, where all The data cluster to be reported contains normal data samples and malicious data samples;

A scoring module is used to score each group of the data clusters to be reported based on the proportion of the normal data samples and the malicious data samples in each group of the data clusters to be reported, and select multiple groups of the data clusters to be reported based on the scores. Report data clusters for reporting.

9. A computer device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that when the processor executes the computer program, the processor implements the claims as claimed in The steps of the data reporting method according to any one of claims 1 to 7.

10. A computer-readable storage medium with a computer program stored thereon, characterized in that when the program is executed by a processor, the steps of the data reporting method according to any one of claims 1 to 7 are implemented.