CN108573069B

CN108573069B - Twins method for accelerating matching of regular expressions of compressed flow

Info

Publication number: CN108573069B
Application number: CN201810419466.3A
Authority: CN
Inventors: 胡成臣; 孙秀文; 李�昊
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2020-10-27
Anticipated expiration: 2038-05-04
Also published as: CN108573069A

Abstract

The invention relates to a Twins method for accelerating matching of a compressed flow regular expression.A core component is a compressed flow Twins matching engine which comprises a decoding module, a Twins matching algorithm, a finite state automaton and state record data required by a processing process; the compressed flow Twins matching engine uses a regular expression to be matched to construct a finite state automaton, then decodes the byte content of the compressed flow, finally uses a Twins matching algorithm to carry out matching, and outputs a matching result; the Twins matching algorithm scans the decoded text string using a finite state automaton, and processes the encoded string using the Twins algorithm. The method effectively improves the throughput rate of regular matching on the compressed flow under the condition of ensuring that the matching result is the same as that of the Naive method, and has the advantages of high matching speed, simple and convenient implementation and strong expansibility.

Description

Twins method for accelerating matching of regular expressions of compressed flow

Technical Field

The invention relates to a pattern matching method for compressed flow, in particular to a Twins method for accelerating compressed flow regular expression matching.

Background

With the wide application of compression technology in network traffic, more and more Web servers compress HTTP page content and send the compressed HTTP page content to a browser. 66% of sites in Alexa Top 1000 in month 7 2010 used HTTP compression, while in Top 500sites in month 10 in 2016, this proportion was already over 95%. And these compression flows are about 20% compression ratios, which greatly affect the pattern matching speed of the compression flows.

In addition, more and more Deep Packet Inspection (DPI) based tools and applications employ regular expression matching engines to identify features in traffic for comprehensive and multi-level matching. Such as intrusion detection systems, flow pricing, firewalls, and the like. These tools face the compression flow rate, typically using two approaches:

(1) naive method (Naive): that is, the compressed flow is completely decompressed first, and then the decompressed data is pattern-matched byte by byte. The method is the most naive method, is simple to implement, but has the advantage that the processing throughput rate is greatly reduced due to the existence of compression, and becomes a performance bottleneck in the whole process of the system.

(2) Patch method (Patch): and the request of the client is modified to inform the server that the server does not receive the compressed data, so that the server is forced to send the original data. This approach only avoids the overhead of the traffic decompression process compared to Naive. But the integrity of the communication data between the client and the server is damaged; furthermore, the original purpose of HTTP design for compressing the traffic is abandoned by using the uncompressed traffic, and the use of the network bandwidth cannot be reduced.

At present, there are many relevant patents related to multi-pattern matching, such as a chinese patent "a multi-string matching method" disclosed in 2010, 12/01, a chinese patent "a multi-string matching method and chip" disclosed in 2007, 10/10, a chinese patent "deep packet inspection method based on suffix automaton regular engine construction" disclosed in 2013, 08/21, and a chinese patent "an adaptive multi-pattern matching method and system" disclosed in 2006, 11/29, but none of them relates to multi-pattern matching for compressed traffic. Some DPI work is performed on HTTP compressed traffic, wherein ACCH published in IEEE/ACM Transactions on network in 2012 and COIN published in IEEE/ACM systematic on Quality of Service in 2017, both of which accelerate the pattern matching process by decompressing the traffic and then skipping the scanning of partial characters in the matching process by using the information stored in the decompressing process. However, they can only perform multi-string scanning on compressed traffic, and cannot be applied to matching of regular expressions, thereby further limiting application scenarios. The ARCH published in the article of IEEE Conference on Computer Communications in 2015 is a method capable of performing regular expression matching on compressed traffic, but its core algorithm is the same as the ACCH, and the algorithm has more overhead in the matching process; in addition, the method for calculating depth parameters, which is proposed to adapt to the ACCH algorithm, is time-consuming and difficult to meet the requirement of line speed matching, and even in some implementations based on a fast automaton engine, the processing throughput rate is still not as high as that of the Naive method.

To further illustrate the details of the present invention, we first present the terms of art and definitions to which this invention relates:

A)gzip/DEFLATE

gzip is a commonly used content encoding method recommended by HTTP1.1, wherein we use gzip as its encoding method for 434 pages out of 460 pages acquired in 2017 in month 5 according to the Aleax Top 500 list in 2016 in month 10. DEFLATE is the compression method used by gzip and implements compression and encoding based on LZ77 and huffman encoding, respectively.

FIG. 1 shows a schematic diagram of the gzip compression process, where the original text is a two-line character string representing the URL of a web page, and the "http:// www." in the second line is encoded as <11,17> by LZ77 compression. Indicating that the compressed content has a length of 11 bytes and can be copied from the current position shifted forward by 17 bytes. Here we call the < length, distance > pair, i.e. <11,17> as the encoding string; the reference character string is called "http:// www." in the first row, and the position relationship between the two is schematically shown in fig. 2. The character content other than the encoded character string is referred to as a text character string.

The LZ77 compressed data contains a text string and an encoded string, which are then encoded using a huffman encoding method to generate the DEFLATE data format used by gzip. Because the huffman code lengths are not equal and not all multiples of 8, the DEFLATE data is a continuous bit stream and does not have byte encoding boundaries. This is also why the existing method must be decoded before string matching can be performed.

B) Regular expression matching

String matching can be viewed as a subset of regular expressions that can only handle some simple pattern matches that do not contain Clin closures and the like. The regular expression matching engine is therefore able to handle matching of string patterns, whereas otherwise it is not.

In computational theory, regular expressions, Deterministic Finite state Automaton (DFA) and Non-Deterministic Finite state Automaton (NFA) can equivalently represent regular languages. A finite state automaton can be formally represented as a 5-tuple a ═ Q, Σ, Q₀F), wherein:

q is a non-empty finite set of states; non-empty finite character sets, commonly referred to as the input alphabet; QxSigma → Q transfer function; q. q.s₀An initial state, q₀E is Q; f, the collection of the accepting states,

DFA and NFA differ in the return value of the transfer function, which returns a single state, whereas NFA returns a collection of states, which may contain more than one state.

The finite state automaton starts from a starting state, reads in character strings to be matched character by character, and shifts to the next state according to a given transfer function. After reading the string, if the automaton stops in an accepting state belonging to F, it accepts the string, i.e. we said to match to the pattern; otherwise, the character string is rejected.

A regular expression is generally called a pattern, and a pattern matching process described by using a finite state automata generally compiles the regular expression into an NFA, then converts the NFA into a DFA, performs minimization on the DFA, and finally performs scanning processing on data to be matched by using the DFA or directly using the NFA.

In practical applications, the branch table of the DFA is usually implemented by a two-dimensional matrix, which has a higher matching speed but needs to occupy more memory. Meanwhile, some studies, which sacrifice partial matching speed to reduce occupied memory, are commonly referred to as compressed DFA, by compressing the transfer table, such as the one published in 2007 by ACM/ieee symposium on Architecture for Networking and Communications Systems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a Twins method for accelerating the matching of a regular expression of compressed flow, which has the advantages of high matching speed, simple and convenient realization and strong expansibility.

The invention is realized by the following technical scheme:

a Twins method for accelerating the matching of a regular expression of compressed flow is characterized in that a core component is a Twins matching engine of the compressed flow, and the Twins matching engine comprises a decoding module, a Twins matching algorithm, a finite state automaton and state record data required by a processing process;

the compressed flow Twins matching engine uses a regular expression to be matched to construct a finite state automaton, then decodes the byte content of the compressed flow, finally uses a Twins matching algorithm to carry out matching, and outputs a matching result;

the Twins matching algorithm scans the decoded text string using a finite state automaton, and processes the encoded string using the Twins algorithm.

Preferably, when the Twins matching algorithm is used for processing the code string, the steps are as follows:

step (1) using a parameter curPos, assigning an initial value to be 0, and then judging whether the states returned by the automaton at the position of a byte before a coded character string and a corresponding reference character string are the same or not, if the states are different, entering step (2), and if the states are the same, turning to step (3);

starting from the first byte of the coded character string, scanning by using a finite state automaton, wherein each time one byte is scanned, the currPos is increased by 1;

then comparing whether the returned state is the same as the state stored at the same offset in the corresponding reference character string, and if so, turning to the step (3); if not, scanning the next byte; returning to the state of processing the last character until all characters in the code character string are scanned; if the receiving state is found in the scanning process, the state number and the byte position are stored in the matching result;

copying the state stored after the shift of the curPos in the reference character string to the same shift of the coding character string, and if the receiving state is found in the copying process, storing the state number and the byte position as a matching result; after the completion, the state of the last byte of the code character string is returned, and the scanning operation is continued.

Preferably, the Twins matching algorithm processes the compressed flow by:

step 1, constructing a matching engine: firstly, analyzing a regular expression to be matched, constructing a finite state automaton for matching, wherein the finite state automaton comprises DFA or NFA, and numbering each state of the automaton as a unique identifier of the automaton; then applying for a storage space and storing the state record data of the processing process;

and step 2, decoding: reading compressed flow data, using static Huffman coding or constructing a Huffman coding tree according to different types of data, and analyzing the compressed data; decoding compressed data into two categories: a text string and an encoding string;

step 3, Twins algorithm processing: scanning the decoded text character string by using a finite state automaton; for the coded character string, processing by using a Twins algorithm; in the scanning and processing process, state record data and matching results are updated at any time;

and 4, repeating the steps 2-3 until all the compressed flow is processed.

Preferably, the decoding module performs huffman decoding on the data compressed by adopting gzip or DEFLATE method; decoding causes the original compressed traffic, which is not byte-bounded, to become a text string and an encoded string, which are byte-bounded.

Preferably, the finite state automaton uses an automaton construction algorithm to compile the regular expression into a finite state automaton that saves the returned state into the state record data as each byte is scanned.

Preferably, the state record data is used for saving the state of the automaton returned by the scanning data of the working process of the compression flow Twins matching engine and other required parameters.

Compared with the prior art, the invention has the following beneficial technical effects:

in the matching scanning process, the method records the state of the automaton returned by scanning each byte, and skips the scanning of the byte content behind the code character string if the returned state of the automaton is found to be the same as the state stored at the same offset position in the corresponding reference character string when the code character string is processed. The method effectively improves the throughput rate of regular matching on the compressed flow under the condition of ensuring that the matching result identical to that of the Naive method is obtained. It has the following advantages:

(1) the matching speed is high

In the existing method for matching character strings of compressed flow, under the condition that the compression ratio is 20%, the matching throughput rate of the Naive method is reduced to about 20% of the throughput rate of the uncompressed data. In the implementation based on the compressed DFA, the ARCH is the one with the highest matching speed, and compared with a Naive method, the throughput rate is improved by about 3.2 times. However, the speed of the DFA implemented based on the two-dimensional matrix is still not as fast as the Naive method.

The invention is realized based on two DFAs, and the matching performance of the invention is effectively improved. Based on the DFA implementation scheme, the average throughput rate reaches 1.2Gbps, which is 2.86 times of that of ARCH; the average throughput rate of compressed DFA-based implementations is 2.1 times greater.

(2) Simple and convenient to realize

Since the ARCH uses the same algorithm as the ACCH, it requires the user to specify parameters during its use. In addition, the ARCH needs to calculate an Input-Depth parameter for each character to be matched, and then the parameter is applied to a processing algorithm, so that the processing process is complicated. In the matching process, the invention only needs to store the state returned in the matching process, has small additional cost, does not need to manually set parameters, and is simple to realize.

(3) Strong expansibility

The finite state automata in the invention can select a general DFA or NFA which is widely used, or other various types of finite state automata designed for the purpose of reducing the space size of the DFA, and the like, so that the conventional system can be conveniently modified to improve the throughput rate when regular expression matching is carried out on compressed flow.

Drawings

FIG. 1 is a schematic diagram of the gzip compression process described in the prior art.

FIG. 2 is an exemplary diagram of the reference string, the encoded string, and the text string after gzip decompression using the exemplary data of FIG. 1.

FIG. 3 is a logic block diagram of the processing of the method in an example of the invention.

Fig. 4 is example data illustrating a process of encoding a character string.

FIG. 5 is a DFA generated by an example of a regular expression (ab + c) | (bc + d).

Fig. 6 is a diagram illustrating a process of encoding a character string and a status recording result.

FIG. 7 is a comparison graph of average throughput rates under three sets of rule sets, for regular expression matching based on two DFA implementation schemes, in comparison with ARCH and Naive methods.

Detailed Description

The technical solution of the present invention will be described in detail and fully with reference to the following examples, and it should be understood that the described examples are only a part of the examples of the present invention, and not all of the examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a regular expression matching method for accelerating compressed flow, which is called Twins method, and the core component of the method is a compressed flow Twins matching engine. The method comprises the steps of constructing a finite state automaton by using a regular expression to be matched, decoding the content of compressed flow data, scanning the decoded data by using a Twins matching algorithm, and outputting a matching result. The engine comprises three modules of decoding, Twins matching algorithm and finite state automata, and intermediate data stored in a processing process, namely a state recording module. The technical scheme of the invention has higher throughput rate to the compression flow, is convenient to use and has better expansibility.

As shown in fig. 3, a dashed box is a compressed flow Twins matching engine 101, which constructs a finite state automaton 1013 using a regular expression 103 to be matched, processes the compressed flow 102, and outputs a matching result 104. The core component is a compression traffic Twins matching engine 101. It includes three processing modules of decoding module 1011, Twins matching algorithm 1012 and finite state automata 1013, and state record data 1014 required for the processing procedure. The engine uses the regular expression 103 to be matched to construct a finite state automaton 1013, then decodes the content of 102 bytes of compressed flow, finally uses the Twins matching algorithm 1012 to match, and outputs the matching result 104.

In the specific implementation process, the decoding module 1011 in the matching engine performs huffman decoding on the content compressed by gzip or DEFLATE method. In decoding, a huffman tree is constructed according to static or dynamic huffman coding used by compressed content, then compressed format data is analyzed, and a coded text string and a coded string are sequentially extracted, and the relationship between them is shown in fig. 2.

The finite state automata 1013 compiles the regular expression into the finite state automata 1013 using existing automata construction algorithms and marks its state in the construction process. In this embodiment, we use a two-dimensional matrix to save the state transition table of the automaton as the implementation of the DFA. Meanwhile, a finite state automaton used in performance evaluation in an ARCH paper is used as an implementation of compressing DFA. In the process of pattern matching, the automaton reduces the size of the DFA by compressing the number of transfer functions under the condition of sacrificing some matching performance, and at most 2N times of processing is needed when matching the content to be matched with the length of N.

The state record data 1014 stores intermediate data such as a state used in the engine operation process. Since DEFLATE specifies that the maximum distance of the encoded string to the reference string is 32768, we use the 32K circular queue save finite state automata 1013 to scan the state returned by each character in this embodiment.

The Twins matching algorithm 1012 directly scans the decoded text string using the finite state automata 1013. The Twins algorithm is used for processing the code character string, and the process is as follows:

(1) assigning a currPos value to be 0, and then judging whether the states returned by the automaton at the position of the encoding character string and the previous byte of the corresponding reference character string are the same or not, if the states are different, entering the step (2), and if the states are the same, turning to the step (3).

(2) Starting from the first byte of the encoded string, the scan is performed using an automaton, with a currpos increment of 1 for each byte scanned. And then comparing whether the returned state is the same as the state stored at the same offset in the corresponding reference character string, and if so, turning to the step (3). If not, scanning the next byte; and returning to the state of processing the last character until all characters in the code character string are scanned. If an acceptance state is found during the scanning process, the state number and the byte position are saved in the matching result 104.

(3) Copying the state stored after the shift of the currpos in the reference string to the same shift of the encoding string, and if an accepting state is found during the copying process, storing the state number and the byte position as the matching result 104. After the completion, the state of the last byte of the code character string is returned, and the scanning operation is continued.

In order to more intuitively explain the processing procedure of the encoding character string, the present invention is exemplified by the data in fig. 4 and the finite state automaton in fig. 5, and the process state records of the processing correspond to (a) to (b) in fig. 6. In fig. 4, the content in the parenthesis "< >" in the compressed data is the code string. States 3 and 6 in FIG. 5 are accepting states, representing a match to the pattern represented by the regular expression. In fig. 6, light gray on the left side indicates a reference character string region, and dark gray on the right side indicates an encoding character string region; the 1 st line is the byte scanned by the automaton, and the 2 nd line (State) is the State number, i.e., each State in fig. 5.

(a) Ex 1: the algorithm step (1) assigns a currpos value of 0, and since the State (State) at the previous byte "2" of the code string and the State of the previous byte of the corresponding reference string are both 0, the algorithm goes to step (3), and at this time, the currpos value is 0. The saved State ("1235") at the reference string location is copied to the encoding string and the last State 5 is returned, ending the processing of Ex 1.

(b) Ex 2: the algorithm step (1) assigns currpos to 0, and since the State (State) at the previous byte "3" of the code string is 0, it is different from the State 1 at the previous byte of the corresponding reference string, so it goes to step (2). The code string is scanned continuously for three bytes "bcc" resulting in a return state of 5, which is the same as the state stored at the same offset location in the reference string, and step (3) is performed, where currpos is 3. Thereafter, the two states "60" after the currpos are copied from the same position of the reference character to the position of the encoded string. Finally, return to state 0 ends the processing for Ex 2.

In order to illustrate the actual effect, the invention selects real compressed flow data and a regular expression set to be matched for verification. The compressed flow is the compressed page data obtained by the crawler program from Alexa Top 500Sites, and the characteristics are shown in Table 1. In addition, the regular expressions to be matched are three sets of rule sets Snort24, Snort31 and Snort34 used in the ARCH paper.

TABLE 1 compression flow characteristics collected

Number of pages	Compression size (MB)	Decompressed size (MB)
			434	15.54	70.24

And respectively carrying out matching analysis on the three groups of rule sets under Intel i5-4460 and 8G RAM platforms. In the embodiment of the invention and the ARCH and Naive method based on DFA and compressed DFA, the comparison result of the average throughput rate in the matching process is shown in FIG. 7. As can be seen from the figure, in both embodiments, the throughput of the present invention is significantly improved compared to the ACCH and Naive methods.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Twins method for accelerating compressed flow regular expression matching is characterized in that a core component is a compressed flow Twins matching engine (101) which comprises a decoding module (1011), a Twins matching algorithm (1012) and a finite state automaton (1013), and state record data (1014) required by a processing process;

a compressed flow Twins matching engine (101) uses a regular expression (103) to be matched to construct a finite state automaton (1013), then decodes byte contents of the compressed flow (102), finally uses a Twins matching algorithm (1012) to carry out matching, and outputs a matching result (104);

the Twins matching algorithm (1012) uses a finite state automaton (1013) to scan the decoded text character string, and uses the Twins algorithm to process the coded character string;

when the Twins matching algorithm (1012) is used to process the encoded string, the steps are as follows:

then comparing whether the returned state is the same as the state stored at the same offset in the corresponding reference character string, and if so, turning to the step (3); if not, scanning the next byte; returning to the state of processing the last character until all characters in the code character string are scanned; if the receiving state is found in the scanning process, the state number and the byte position are stored in the matching result (104);

copying the state stored after the shift of the curPos in the reference character string to the same shift of the coding character string, and storing the state number and the position of the byte as a matching result (104) if an accepting state is found in the copying process; after the completion, the state of the last byte of the code character string is returned, and the scanning operation is continued.

2. The Twins method for accelerating regular expression matching of compressed flow according to claim 1, wherein the Twins matching algorithm processes compressed flow by the following steps:

step 1, constructing a matching engine: firstly, analyzing a regular expression (103) to be matched, constructing a finite state automaton (1013) for matching, wherein the finite state automaton comprises a DFA (DFA) or an NFA (NFA), and numbering each state of the automaton as a unique identifier of the automaton; then applying for a storage space, and saving the state record data (1014) of the processing process;

and step 2, decoding: reading data of the compressed flow (102), and analyzing the compressed data by using static Huffman coding or constructing a Huffman coding tree according to different types of data; decoding compressed data into two categories: a text string and an encoding string;

step 3, Twins algorithm processing: scanning the decoded text character string by using a finite state automaton; for the coded character string, processing by using a Twins algorithm; in the scanning and processing process, updating the state record data (1014) and the matching result (104) at any time;

and 4, repeating the steps 2-3 until all the compressed flow is processed.

3. The Twins method for accelerating regular expression matching of compressed flow according to claim 1, wherein said decoding module (1011) performs Huffman decoding on data compressed by gzip or DEFLATE method; the decoding causes the original compressed traffic (102), which is not byte-bounded, to become a text string and an encoded string, which are byte-bounded.

4. The Twins method of accelerated compressed flow regular expression matching as claimed in claim 1, wherein the finite state automata (1013) uses automata construction algorithm to compile the regular expression into the finite state automata (1013), and the finite state automata (1013) saves the returned state into the state record data (1014) while scanning each byte.

5. The Twins method for accelerating packed flow regular expression matching according to claim 1, wherein the state of the automaton returned by the working process scanning data of the packed flow Twins matching engine (101) and other required parameters are saved by the state record data (1014).