CN117473094B - Log classification method and system - Google Patents
Log classification method and system Download PDFInfo
- Publication number
- CN117473094B CN117473094B CN202311811547.5A CN202311811547A CN117473094B CN 117473094 B CN117473094 B CN 117473094B CN 202311811547 A CN202311811547 A CN 202311811547A CN 117473094 B CN117473094 B CN 117473094B
- Authority
- CN
- China
- Prior art keywords
- log
- classified
- prime
- congruence
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims abstract description 103
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 71
- 238000003860 storage Methods 0.000 claims abstract description 68
- 238000004364 calculation method Methods 0.000 claims abstract description 57
- 238000012795 verification Methods 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 15
- 230000000977 initiatory effect Effects 0.000 claims description 6
- 238000004891 communication Methods 0.000 abstract description 5
- 238000012432 intermediate storage Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000001172 regenerating effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a log classification method and system. The method comprises the following steps: the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm. And the calculation tasks are balanced and shared, a simple encryption algorithm is used for replacing a complex encryption algorithm to carry out communication, meanwhile, clustering is carried out based on congruence operation results, the accuracy of classifying the new modes is improved, and the calculation load is effectively reduced. The method and the device solve the technical problems that the accuracy of classifying the new mode is low and the calculation load is large.
Description
Technical Field
The application relates to the field of network security, in particular to a log classification method and system.
Background
In the field of network security, analysis of generalization results of logs or alarms generated by various systems, devices or products is one of the most main basic capabilities, whether the logs can be reasonably classified or clustered, whether the proper generalization can be correctly analyzed, and whether the user can be provided with assurance of key information, so related researches and methods in the field are continuously proposed, and the method is mainly focused on automatic classification of log information.
At present, for general automatic log classification, a main adopted method is to classify by using a TFIDF method basically based on text after word segmentation, and although the TFIDF-based method is simpler and quicker, the word sequence relation between word segmentation cannot be reflected, so that poor effect and wrong classification can be possibly caused; the more accurate mode is to vectorize word segmentation results, and the hidden Markov mode, the recurrent neural network or the long-short-term memory network are comprehensively utilized to classify logs in different possible modes.
In addition, in the current environment, the calculation amount for log classification is large, the calculation capacity of the distributed acquisition equipment for directly acquiring log information is limited, the distributed acquisition equipment can only be carried out according to a specific and organized generalized model, and classification accuracy is affected by directly classifying a new and historically unrecognizable mode, so that classification work can be generally carried out on other special nodes with stronger calculation capacity, but the nodes can be local to a user or in a cloud, calculation tasks are still too concentrated, an encryption algorithm with higher intensity is used in the transmission process, and the calculation amount is large for the load of the cloud and the acquisition equipment and cannot meet the actual needs.
Aiming at the problems of low accuracy and large calculation load of classifying new modes in the related technology, no effective solution is proposed at present.
Disclosure of Invention
The main purpose of the application is to provide a log classification method and system, so as to solve the problems of low accuracy and large calculation load of classifying new modes.
To achieve the above object, according to one aspect of the present application, there is provided a log classification method.
The log classification method according to the application comprises the following steps: the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the obtaining of the log to be classified includes: and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
Further, performing congruence operation on the log to be classified by using the prime matrix further comprises: performing finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage; the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm; the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data; dividing the logs to be classified which pass verification according to the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the prime number sets are regenerated at a fixed frequency.
Further, performing congruence operation on the log to be classified by using the prime matrix comprises: defining a prime matrix p= { P ij -a }; define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings; for the same packet in S, { p., j congruence operation, then for the next packet, { p., (j+1)modn and performing congruence operation.
Further, the clustering algorithm is a DBSCAN algorithm, a cosine similarity algorithm or a Jaccard similarity algorithm.
Further, the clustering algorithm is used for clustering the division result, and then the clustering algorithm further comprises the following steps: the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm; and the storage node stores according to the clustering operation result.
Further, the columns of the prime matrix are longitudinal congruence operation factors and are used for carrying out simple congruence calculation of different modes on the same block of content results to be calculated; and the behavior transverse congruence calculation segment of the prime matrix is used for calculating aiming at different contents to be calculated.
To achieve the above object, according to another aspect of the present application, there is provided a log classification system.
The log classification system according to the present application includes: the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage; the storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm; the cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Further, the log acquisition node is further used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage; the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm; the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data; dividing the logs to be classified which pass verification according to the congruence operation result; and clustering the division results by using a clustering algorithm.
In the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application and to provide a further understanding of the application with regard to the other features, objects and advantages of the application. The drawings of the illustrative embodiments of the present application and their descriptions are for the purpose of illustrating the present application and are not to be construed as unduly limiting the present application. In the drawings:
FIG. 1 is a flow diagram of a log classification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a log classification device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the present application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal" and the like indicate an azimuth or a positional relationship based on that shown in the drawings. These terms are only used to better describe the present invention and its embodiments and are not intended to limit the scope of the indicated devices, elements or components to the particular orientations or to configure and operate in the particular orientations.
Also, some of the terms described above may be used to indicate other meanings in addition to orientation or positional relationships, for example, the term "upper" may also be used to indicate some sort of attachment or connection in some cases. The specific meaning of these terms in the present invention will be understood by those of ordinary skill in the art according to the specific circumstances.
Furthermore, the terms "mounted," "configured," "provided," "connected," "coupled," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; may be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements, or components. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
According to an embodiment of the present invention, there is provided a log classification method, as shown in fig. 1, including steps S101 to S106 as follows:
s101, a prime number set is generated by a log acquisition node and is formed into a prime number matrix;
specifically, the log collection node may automatically generate a prime number set, which is regenerated at a fixed frequency; i.e. regenerating prime sets at intervals, such as hourly, daily, etc., depending on the user's requirements for security strength, differential attacks can be prevented to some extent.
Combining the generated prime number sets into an m x n matrix, wherein the columns of the matrix are longitudinal congruence operation factors, and the rows are called transverse congruence calculation fragments; in order to accelerate the calculation, the rows and columns are not too large, and if m is 3, n is 3; and the prime number selected is not too large, which is appointed in the patent2 32 Inside.
Note that the reason why prime numbers are selected as the modulus is that the calculation results can be distributed relatively uniformly by utilizing the finite field characteristics of the modulus prime number residual system, and that square residuals can be adopted, however, the calculation amount is large and the distribution results are unbalanced.
The acquisition node generates a prime modulus set (in order to ensure certain calculation efficiency, the prime numbers are not too large), the prime modulus set is organized into an m multiplied by n matrix, each column is called a longitudinal congruence operation factor, and the function of the prime modulus set is to perform simple congruence calculation of different modulus on the same block of content result to be calculated, wherein the meaning of the prime modulus set is to ensure the accuracy of clustering to a certain extent; the same row is called a horizontal congruence calculation segment, and the function of the horizontal congruence calculation segment is mainly to calculate different contents to be calculated, and the function of the horizontal congruence calculation segment is to ensure that differential attack is prevented to a certain extent, wherein the meaning of preventing differential attack is that even if an attacker obtains a decrypted congruence calculation result (a result obtained after processing based on prime number set and logs to be classified later), some known information cannot be utilized and what the original information is can be deduced from the information through certain calculation, even if the information is not encrypted.
In the embodiment of the application, only the classification of the unrecognized mode is aimed at, so that the process of unidentified and new types of logs in the system is simplified, otherwise, generalized script making personnel face a large amount of unclassified information, and the newly-appearing log types are difficult to cover at a time. It should be noted that this classification is unsupervised.
It should be understood that the generalizable log further generalizes the collected related log by using the deployed log collection node. Since generalization is performed depending on patterns that have been historically identified, logging of identified patterns is typically performed on a regular basis, such patterns being based on historical accumulation, i.e., the product provider performs a fine analysis from historically obtained logs from which relevant classification information and other content, such as TCP/IP quintuples, user information, file access information, etc., are obtained.
It will be appreciated by those skilled in the art that prime numbers refer to integers that can only be divisible by 1 and themselves.
It will be appreciated by those skilled in the art that congruence means that given a positive integer m, if two integers a and b satisfy a-b that is divisible by m, i.e., (a-b)/m gives an integer, then the integer a is said to be congruent to the integer m, denoted as a≡b (mod m). The modulo m congruence is an equivalence relation of integers.
Step S102, congruence operation is carried out on the logs to be classified by utilizing a prime matrix;
specifically, performing congruence operation on the log to be classified by using the prime matrix comprises:
defining a prime matrix p= { P ij -a }; the prime number is the product of the number of rows and the number of columns, and is written in a matrix form:
these prime numbers are not necessarily all different;
define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; where the string length is;
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, result in concatenation, i.e. one packet length will be from 2 128 Becomes as followsBecause they contain m different primes (because this primenet is m rows);
for the next packet, { p., (j+1)modn and performing congruence operation, so that all groups are circularly traversed, and different groups use different prime matrix arrays to perform operation.
The processing for one packet is one of the followingExample (original packet information uses seg) k Representation):
seg k ≡a 1k (p 1i )
seg k ≡a 2k (p 2i )
…
seg k ≡a mk (p mi )
finally, a packet seg k Represented as a 1k a 2k… a mk Other groupings are handled similarly, except that a prime number set of different columns in the matrix is used;
all packets after overall processing are composed of seg 1 seg 2… seg K Is converted into a 11 a 21… a m1... a 1K a 2K… a mK Where K is the number of packets:
,
,/>indicating that it is not divisible;
'[ ]' is a rounding operation.
Step S103, transmitting the log to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the log acquisition node transmits the processed congruence operation result, prime matrix and the original content of the log to be classified to the storage node, and the storage node stores the data.
Step S104, the storage node initiates a classification request to the cloud service node, and the log to be classified, the prime matrix and the congruence operation result are uploaded to the cloud service node by using an encryption algorithm;
and the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
It is to be understood that, because a plurality of different prime-modulus are utilized to perform a certain degree of congruence operation on the word to be classified, the network transmission of the prime-modulus set can adopt a general business-secret or national-secret algorithm; the algorithm is much less computationally intensive than other encryption algorithms, such as 3DES, AES, homomorphic encryption (common homomorphic encryption methods such as the Paillier algorithm), etc.
Step S105, the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result;
and S106, clustering operation is carried out on the division results by using a clustering algorithm.
Dividing according to the uploading task and prime matrixes thereof, wherein data to be classified with the same prime matrix are used as a classification calculation task, the classification task performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity, jaccard similarity and the like, and if the similarity of the data exceeds a certain set threshold (such as 85%), the data are classified into one class; the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
It should be appreciated that the three-level structure is adopted to classify the original log information, and the structure comprises a log collecting component, an intermediate storage node and a cloud log classifying and calculating node, wherein the log collecting component and the intermediate storage node are deployed in the internal environment of the user, the cloud log classifying and calculating node is deployed in the external environment of the user, a plurality of prime models are generated by each collecting node, namely prime model sets generated by the collecting nodes are different, and the prime model sets are transferred to the cloud log classifying and calculating node by the intermediate storage node.
From the above description, it can be seen that the following technical effects are achieved:
in the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
According to an embodiment of the present invention, preferably, the obtaining of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
The log collection node performs word segmentation on the original log by using a word segmentation algorithm according to a dictionary (containing Chinese and English), and removes stop words, wherein the stop words mainly comprise punctuation marks, digit strings which are common and nonsensical and month information (such as Jan and Feb) in the log, and the stop words are removed, are not taken as the final word segmentation result, and only the word segmentation (Chinese is also) based on the dictionary generally.
According to an embodiment of the present invention, preferably, performing congruence operation on the log to be classified by using the prime matrix further includes:
performing finite field number theory inverse operation on the congruence operation result to generate check data;
transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm;
the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
The log acquisition points generate check data with a certain lower probability (such as 1%) according to the data scale, and the check data is generated by the inverse operation of the finite field number theory, and the method is as follows:
selecting a congruence calculation result of data needing to be subjected to classification operation, and inverting a packet number theory:。
the log acquisition node transmits the processed congruence operation result, prime number matrix, check data and the original content of the log to be classified to the storage node, and the storage node stores the data.
And the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix, the check data and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
After receiving the related data, the cloud service node verifies the data with the computable verification, divides the data to be classified according to the uploading task and the prime matrix thereof after the data are subjected to inverse verification through the number theory, takes the data to be classified with the same prime matrix as a classification computing task, performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity and Jaccard similarity, and classifies the data into one class if the similarity of the data exceeds a certain set threshold (such as 85 percent); the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
The cloud service node can check whether the data to be processed is not tampered or not by using the prime matrix and the inverse operation thereof according to part of the data to be checked, and the calculation load of the inverse calculation of the prime matrix is not very large, so that the calculation load is further reduced on the premise of ensuring the data safety.
According to the embodiment of the present invention, preferably, the clustering algorithm is further used to perform clustering operation on the division result, and the method further includes:
the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node stores according to the clustering operation result.
If the classification nodes are deployed in the cloud, especially on public clouds, such work may cause leakage of sensitive information, even if a strong encryption algorithm is used during transmission, the complete information is stored in the cloud, and once the cloud host is trapped, the sensitive log information may be leaked. In order to solve the problem, the information to be classified and subjected to congruence calculation is classified according to the classification task and the acquired prime modulus matrix, the classified result is fed back to the intermediate storage node, and the cloud computing node cannot land the classified data on any storage resource of the cloud, so that the situation that log information is leaked due to sinking of a cloud host is avoided.
In addition, the classification algorithm provided by the patent is only carried out on public mode information in different original log information, so that the cloud data leakage is not substantially affected.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
According to an embodiment of the present invention, there is also provided a system for implementing the above log classification method, as shown in fig. 2, the system includes:
the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
specifically, the log collection node may automatically generate a prime number set, which is regenerated at a fixed frequency; i.e. regenerating prime sets at intervals, such as hourly, daily, etc., depending on the user's requirements for security strength, differential attacks can be prevented to some extent.
Combining the generated prime number sets into an m x n matrix, wherein the columns of the matrix are longitudinal congruence operation factors, and the rows are called transverse congruence calculation fragments; in order to accelerate the calculation, the rows and columns are not too large, and if m is 3, n is 3; moreover, the prime number selected is not too large, and the patent is appointed as 2 32 Inside.
Note that the reason why prime numbers are selected as the modulus is that the calculation results can be distributed relatively uniformly by utilizing the finite field characteristics of the modulus prime number residual system, and that square residuals can be adopted, however, the calculation amount is large and the distribution results are unbalanced.
The acquisition node generates a prime modulus set (in order to ensure certain calculation efficiency, the prime numbers are not too large), the prime modulus set is organized into an m multiplied by n matrix, each column is called a longitudinal congruence operation factor, and the function of the prime modulus set is to perform simple congruence calculation of different modulus on the same block of content result to be calculated, wherein the meaning of the prime modulus set is to ensure the accuracy of clustering to a certain extent; the same row is called a horizontal congruence calculation segment, and the function of the horizontal congruence calculation segment is mainly to calculate different contents to be calculated, and the function of the horizontal congruence calculation segment is to ensure that differential attack is prevented to a certain extent, wherein the meaning of preventing differential attack is that even if an attacker obtains a decrypted congruence calculation result (a result obtained after processing based on prime number set and logs to be classified later), some known information cannot be utilized and what the original information is can be deduced from the information through certain calculation, even if the information is not encrypted.
In the embodiment of the application, only the classification of the unrecognized mode is aimed at, so that the process of unidentified and new types of logs in the system is simplified, otherwise, generalized script making personnel face a large amount of unclassified information, and the newly-appearing log types are difficult to cover at a time. It should be noted that this classification is unsupervised.
It should be understood that the generalizable log further generalizes the collected related log by using the deployed log collection node. Since generalization is performed depending on patterns that have been historically identified, logging of identified patterns is typically performed on a regular basis, such patterns being based on historical accumulation, i.e., the product provider performs a fine analysis from historically obtained logs from which relevant classification information and other content, such as TCP/IP quintuples, user information, file access information, etc., are obtained.
It will be appreciated by those skilled in the art that prime numbers refer to integers that can only be divisible by 1 and themselves.
It will be appreciated by those skilled in the art that congruence means that given a positive integer m, if two integers a and b satisfy a-b that is divisible by m, i.e., (a-b)/m gives an integer, then the integer a is said to be congruent to the integer m, denoted as a≡b (mod m). The modulo m congruence is an equivalence relation of integers.
The method for carrying out congruence operation on the logs to be classified by using the prime number matrix comprises the following steps:
defining a prime matrix p= { P ij -a }; the prime number is the product of the number of rows and the number of columns, and is written in a matrix form:
these prime numbers are not necessarily all different;
define the log to be classified w= { W i And converts it into a full string set s= { c i -a }; where the string length is;
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, result in concatenation, i.e. one packet length will be from 2 128 Becomes as followsBecause they contain m different prime numbersSince this prime matrix is m rows);
for the next packet, { p., (j+1)modn and performing congruence operation, so that all groups are circularly traversed, and different groups use different prime matrix arrays to perform operation.
The following is an example of processing for one packet (original packet information uses seg k Representation):
seg k ≡a 1k (p 1i )
seg k ≡a 2k (p 2i )
…
seg k ≡a mk (p mi )
finally, a packet seg k Represented as a 1k a 2k… a mk Other groupings are handled similarly, except that a prime number set of different columns in the matrix is used;
all packets after overall processing are composed of seg 1 seg 2… seg K Is converted into a 11 a 21… a m1... a 1K a 2K… a mK Where K is the number of packets:
,
,/>indicating that it is not divisible;
'[ ]' is a rounding operation.
The log acquisition node transmits the processed congruence operation result, prime matrix and the original content of the log to be classified to the storage node, and the storage node stores the data.
The storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm;
and the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
It is to be understood that, because a plurality of different prime-modulus are utilized to perform a certain degree of congruence operation on the word to be classified, the network transmission of the prime-modulus set can adopt a general business-secret or national-secret algorithm; the algorithm is much less computationally intensive than other encryption algorithms, such as 3DES, AES, homomorphic encryption (common homomorphic encryption methods such as the Paillier algorithm), etc.
The cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
Dividing according to the uploading task and prime matrixes thereof, wherein data to be classified with the same prime matrix are used as a classification calculation task, the classification task performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity, jaccard similarity and the like, and if the similarity of the data exceeds a certain set threshold (such as 85%), the data are classified into one class; the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
It should be appreciated that the three-level structure is adopted to classify the original log information, and the structure comprises a log collecting component, an intermediate storage node and a cloud log classifying and calculating node, wherein the log collecting component and the intermediate storage node are deployed in the internal environment of the user, the cloud log classifying and calculating node is deployed in the external environment of the user, a plurality of prime models are generated by each collecting node, namely prime model sets generated by the collecting nodes are different, and the prime model sets are transferred to the cloud log classifying and calculating node by the intermediate storage node.
From the above description, it can be seen that the following technical effects are achieved:
in the embodiment of the application, a multistage log classification architecture with matched log acquisition nodes, storage nodes and cloud service nodes is adopted, so that calculation tasks can be balanced and shared; the method has the advantages that the plurality of different prime modulus operations are adopted to carry out certain-degree congruence operation on the logs to be classified, a simple encryption algorithm can be used for replacing a complex encryption algorithm to carry out communication, and clustering is carried out based on the congruence operation results, so that the technical effects of improving the accuracy of classifying the new mode and effectively reducing the calculation load are achieved, and the technical problems of low accuracy of classifying the new mode and high calculation load are solved.
According to an embodiment of the present invention, preferably, the obtaining of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
The log collection node performs word segmentation on the original log by using a word segmentation algorithm according to a dictionary (containing Chinese and English), and removes stop words, wherein the stop words mainly comprise punctuation marks, digit strings which are common and nonsensical and month information (such as Jan and Feb) in the log, and the stop words are removed, are not taken as the final word segmentation result, and only the word segmentation (Chinese is also) based on the dictionary generally.
According to the embodiment of the invention, preferably, the log acquisition node is further used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm;
the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
The log acquisition points generate check data with a certain lower probability (such as 1%) according to the data scale, and the check data is generated by the inverse operation of the finite field number theory, and the method is as follows:
selecting a congruence calculation result of data needing to be subjected to classification operation, and inverting a packet number theory:。
the log acquisition node transmits the processed congruence operation result, prime number matrix, check data and the original content of the log to be classified to the storage node, and the storage node stores the data.
And the storage node initiates a classification request task to the cloud service node, and simultaneously uploads the log to be classified, the prime matrix, the check data and the congruence operation result to the cloud service node by using a simple encryption algorithm. Here, the task uploaded by the storage node may include data from a plurality of different log collection nodes.
After receiving the related data, the cloud service node verifies the data with the computable verification, divides the data to be classified according to the uploading task and the prime matrix thereof after the data are subjected to inverse verification through the number theory, takes the data to be classified with the same prime matrix as a classification computing task, performs clustering operation on the data by using clustering algorithms such as DBSCAN, cosine similarity and Jaccard similarity, and classifies the data into one class if the similarity of the data exceeds a certain set threshold (such as 85 percent); the reason that the prime matrix must be uploaded instead of the acquisition node identification is that the prime matrix may change dynamically.
The cloud service node can check whether the data to be processed is not tampered or not by using the prime matrix and the inverse operation thereof according to part of the data to be checked, and the calculation load of the inverse calculation of the prime matrix is not very large, so that the calculation load is further reduced on the premise of ensuring the data safety.
According to the embodiment of the invention, the cloud service node is preferably further used for transmitting the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node is also used for storing according to the clustering operation result.
If the classification nodes are deployed in the cloud, especially on public clouds, such work may cause leakage of sensitive information, even if a strong encryption algorithm is used during transmission, the complete information is stored in the cloud, and once the cloud host is trapped, the sensitive log information may be leaked. In order to solve the problem, the information to be classified and subjected to congruence calculation is classified according to the classification task and the acquired prime modulus matrix, the classified result is fed back to the intermediate storage node, and the cloud computing node cannot land the classified data on any storage resource of the cloud, so that the situation that log information is leaked due to sinking of a cloud host is avoided.
In addition, the classification algorithm provided by the patent is only carried out on public mode information in different original log information, so that the cloud data leakage is not substantially affected.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.
Claims (10)
1. A method of sorting logs, comprising:
the log acquisition node generates a prime number set and forms the prime number set into a prime number matrix;
performing congruence operation on the logs to be classified by using the prime number matrix;
transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix and a congruence operation result to the cloud service node by using an encryption algorithm;
the cloud service node divides the logs to be classified according to the prime matrix and the congruence operation result;
and clustering the division results by using a clustering algorithm.
2. The log classification method according to claim 1, wherein the acquisition of the log to be classified includes:
and the log acquisition node performs word segmentation on the original log and removes stop words to obtain the log to be classified.
3. The log classification method of claim 1, wherein the performing a congruence operation on the log to be classified using the prime matrix further comprises:
performing finite field number theory inverse operation on the congruence operation result to generate check data;
transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node initiates a classification request to the cloud service node, and uploads a log to be classified, a prime matrix, a congruence operation result and check data to the cloud service node by using an encryption algorithm;
the cloud service node performs the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
4. The log classification method of claim 1, wherein the prime number sets are regenerated at a fixed frequency.
5. The log classification method of claim 1, wherein performing congruence operations on the log to be classified using the prime matrix comprises:
defining a prime matrix p= { P ij };
Define the log to be classified w= { W i And converts it into a full string set s= { c i };
Grouping the character strings in the S according to 16 bytes, and filling character strings with less than 16 bytes through preset character strings;
for the same packet in S, { p., j congruence operation, then for the next packet, { p., (j+1)modn and performing congruence operation.
6. A method of log classification as claimed in claim 1 or 3 wherein the clustering algorithm is a DBSCAN algorithm, a cosine similarity algorithm or a Jaccard similarity algorithm.
7. A log classifying method according to claim 1 or 3, wherein the clustering algorithm is used to perform clustering operation on the division result, and further comprising:
the cloud service node transmits the clustering operation result back to the storage node by using an encryption algorithm;
and the storage node stores according to the clustering operation result.
8. The method according to claim 1, wherein columns of the prime matrix are longitudinal congruence operation factors for performing simple congruence calculations of different modes on the same block of content results to be calculated; and the behavior transverse congruence calculation segment of the prime matrix is used for calculating aiming at different contents to be calculated.
9. A log classification system, comprising:
the log acquisition node is used for generating a prime number set and forming the prime number set into a prime number matrix; performing congruence operation on the logs to be classified by using the prime number matrix; transmitting the logs to be classified, the prime matrix and the congruence operation result to a storage node for storage;
the storage node is used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime matrix and the congruence operation result to the cloud service node by using an encryption algorithm;
the cloud service node is used for dividing logs to be classified according to the prime matrix and the congruence operation result; and clustering the division results by using a clustering algorithm.
10. The log classification system of claim 9, wherein the log classification system,
the log acquisition node is also used for carrying out finite field number theory inverse operation on the congruence operation result to generate check data; transmitting the logs to be classified, the prime number matrix, the congruence operation result and the check data to a storage node for storage;
the storage node is also used for initiating a classification request to the cloud service node, and uploading the log to be classified, the prime number matrix, the congruence operation result and the check data to the cloud service node by using an encryption algorithm;
the cloud service node is further used for carrying out the inverse verification of the number theory of the logs to be classified based on the verification data;
dividing the logs to be classified which pass verification according to the congruence operation result;
and clustering the division results by using a clustering algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311811547.5A CN117473094B (en) | 2023-12-27 | 2023-12-27 | Log classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311811547.5A CN117473094B (en) | 2023-12-27 | 2023-12-27 | Log classification method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117473094A CN117473094A (en) | 2024-01-30 |
CN117473094B true CN117473094B (en) | 2024-03-22 |
Family
ID=89639976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311811547.5A Active CN117473094B (en) | 2023-12-27 | 2023-12-27 | Log classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117473094B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263009A (en) * | 2019-06-21 | 2019-09-20 | 深圳前海微众银行股份有限公司 | Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules |
CN112860648A (en) * | 2020-12-30 | 2021-05-28 | 苏宁消费金融有限公司 | Intelligent analysis method based on log platform |
CN113535667A (en) * | 2020-04-20 | 2021-10-22 | 烽火通信科技股份有限公司 | Method, device and system for automatically analyzing system logs |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10528407B2 (en) * | 2017-07-20 | 2020-01-07 | Vmware, Inc. | Integrated statistical log data mining for mean time auto-resolution |
-
2023
- 2023-12-27 CN CN202311811547.5A patent/CN117473094B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263009A (en) * | 2019-06-21 | 2019-09-20 | 深圳前海微众银行股份有限公司 | Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules |
CN113535667A (en) * | 2020-04-20 | 2021-10-22 | 烽火通信科技股份有限公司 | Method, device and system for automatically analyzing system logs |
CN112860648A (en) * | 2020-12-30 | 2021-05-28 | 苏宁消费金融有限公司 | Intelligent analysis method based on log platform |
Also Published As
Publication number | Publication date |
---|---|
CN117473094A (en) | 2024-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zebin et al. | An explainable AI-based intrusion detection system for DNS over HTTPS (DoH) attacks | |
US20220070195A1 (en) | Deep embedded self-taught learning system and method for detecting suspicious network behaviours | |
AU2021218110B2 (en) | Learning from distributed data | |
CN103077347B (en) | A Composite Intrusion Detection Method Based on Data Fusion of Improved Kernel Vector Machine | |
EP3614645B1 (en) | Embedded dga representations for botnet analysis | |
CN111565205A (en) | Network attack identification method and device, computer equipment and storage medium | |
CN111131304B (en) | Cloud platform-oriented large-scale virtual machine fine-grained abnormal behavior detection method and system | |
EP3948604B1 (en) | Computer security | |
WO2018080392A1 (en) | Quantitative unified analytic neural networks | |
US11436320B2 (en) | Adaptive computer security | |
EP3948603B1 (en) | Pre-emptive computer security | |
Arya et al. | Ensemble filter-based feature selection model for cyber attack detection in industrial Internet of Things | |
CN112822153A (en) | Method and system for discovering suspicious threats based on DNS log | |
CN115022038A (en) | Power grid network anomaly detection method, device, equipment and storage medium | |
Szarmach et al. | Multi-label classification for AIS data anomaly detection using wavelet transform | |
CN117473094B (en) | Log classification method and system | |
Zhang et al. | A two-stage intrusion detection method based on light gradient boosting machine and autoencoder | |
CN113434857A (en) | User behavior safety analysis method and system applying deep learning | |
US12013855B2 (en) | Trimming blackhole clusters | |
US20230188552A1 (en) | System and method for autonomously fingerprinting and enumerating internet of thing (iot) devices based on nated ipfix and dns traffic | |
CN114726570B (en) | Method and device for detecting host traffic abnormality based on graph model | |
Graham et al. | Finding and visualizing graph clusters using pagerank optimization | |
Khoshavi et al. | Entropy-based modeling for estimating adversarial bit-flip attack impact on binarized neural network | |
Shen et al. | Mr-triage: Scalable multi-criteria clustering for big data security intelligence applications | |
CN114385436A (en) | Server grouping method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |