research-article

Open access

Symbolic Analysis for Data Plane Programs Specialization

Authors:

Thomas Luinaud,

J. M. Pierre Langlois,

Yvon SavariaAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1

Article No.: 1, Pages 1 - 21

https://doi.org/10.1145/3557727

Published: 17 November 2022 Publication History

All formats PDF

Abstract

Programmable network data planes have extended the capabilities of packet processing in network devices by allowing custom processing pipelines and agnostic packet processing. While a variety of applications can be implemented on current programmable data planes, there are significant constraints due to hardware limitations. One way to meet these constraints is by optimizing data plane programs. Program optimization can be achieved by specializing code that leverages architectural specificity or by compilation passes. In the case of programmable data planes, to respond to the varying requirements of a large set of applications, data plane programs can target different architectures. This leads to difficulties when developers want to reuse the code. One solution to that is to use compiler optimization techniques. We propose performing data plane program specialization to reduce the generated program size. To this end, we propose to specialize in programs written in P4, a Domain Specific Language (DSL) designed for specifying network data planes. The proposed method takes advantage of key aspects of the P4 language to perform a symbolic analysis on a P4 program and then partially evaluate the program to specialize it. The approach we propose is independent of the target architecture. We evaluate the specialization technique by implementing a packet deparser on an FPGA. The results demonstrate that program specialization can reduce the resource usage by a factor of 2 for various packet deparsers.

1 Introduction

Network devices such as switches or Network Interface Cards (NICs) have to support evolving network protocols and functions in different environments [18]. Therefore, there is a growing interest in programmable data planes to process network packets. These devices have been implemented on different technologies such as ASIC [1], software switches [19], or FPGAs [3, 22].

Programmable data planes implement programs by expressing how a packet is processed. Optimizing a data plane program can reduce the processing time, hence reducing the processing latency. When a data plane program is implemented in custom hardware devices, such as FPGAs or ASICs, optimizing a data plane program can reduce resource usage which can lead to lower energy consumption or to increase the number of network functions that can fit in the same device.

One known optimization technique is software specialization that consists of transforming a program into a smaller one [10]. One software specialization technique is partial evaluation. With partial evaluation, part of the program is evaluated using a set of variable values. Partial evaluation allows precomputing some outputs of the program, and to remove invalid paths.

To determine variable values, symbolic analysis can be used. The symbolic analysis consists of evaluating and deriving the possible values of variable at different points in the program. Symbolic analysis is done by evaluating the program, which can be performed during compilation [8].

In this article, we present the specialization of a data plane program using symbolic analysis combined with partial evaluation. The data plane programs are written in P4, a Domain Specific Language (DSL) (Section 2). The article main contributions are

–

A symbolic header validity analysis of functionalities expressed through a P4 program (Section 3).

–

A deparser graph transformation algorithm (Section 4).

–

The specialization of a P4 program with partial evaluation (Section 5).

–

A proof that partial evaluation can generate an optimal deparser graph (Section 5.3).

We present an example of how the optimization can be used (Section 6). We evaluate the optimization by implementing a packet deparser on an FPGA (Section 7). While our solution is evaluated on an FPGA, we claim that other targets would benefit from the proposed optimization (Section 4.2).

2 The P4 Language

P4 is a DSL, proposed by Bosshart et al. in 2014 [5] to express packet processing programs for programmable network data planes. One well-known P4 target is the Reconfigurable Match Table (RMT) architecture [6] which are composed of three configurable blocks, the Parser block, the Processing block, and the Deparser block, as shown in Figure 1.

Fig. 1.

With P4₁₆ 1.2.2 [11], the version considered in this article, each block of the RMT architecture in Figure 1 can be expressed. The Parser, which extracts headers and insert them into the Packet Header Vector (PHV), is described as a P4 |parser| block. The Processing block which modifies the PHV is expressed as a P4 |control| block. As well as the Deparser which recomposes the packet by combining the modified headers from the PHV with the Packet Payload. Finally, the PHV is expressed as a P4 |struct|, which is passed to each block as a parameter.

In the remaining of this section, we present different elements of the P4 language used to program a packet processing architecture. First, we describe how the PHV is expressed in P4. Then, the |parser| block is explained. Finally, the |control| block is described.

2.1 PHV in P4

Listing 1.

2.2 P4 Parser Block

Fig. 2.

Listing 2.

2.3 P4 Control Block

The |control| block in P4 is used to express the processing carried on the PHV. This block can be decomposed into three parts. The first part defines the parameters given to the block, then, there is the declaration part, and finally, there is the processing defined within the |apply| block. Two examples of |control| blocks are presented. One in Listing 3 which express a Processing block and the second in Listing 4 which express a Deparser.

The Processing block is described in Listing 3, the block takes one parameter, |PHV| of type |header_s|. In the deparser block in Listing 4, the block takes two parameters, |Pkt_out| and |PHV|.

In the declaration part of the control block, a programmer can declare |action| and |table|. The |action| element describes functions that can take parameters. Two action declarations are shown in Listing 3 at lines 2 and 5. |table| is a structure abstracting the concept of a match action table. A |table| compares a key against a set of rules and execute an action as a result. The set of rules determines the action to execute with the associated parameters. Also, the rules are defined during runtime and are unknown at compile time. An example of |table| declaration is shown in Listing 3 at line 8.

Listing 3.

Listing 4.

3 Symbolic Analysis On Headers

Specializing in a program requires prior knowledge of the data the program has to process. This information can be obtained by analyzing variables at different points in the program. One analysis method consists into the symbolic execution of the program to evaluate, for every execution path, variable values. In many cases, such an analysis cannot be done because there are too many possibilities. However, in the case of P4 programs, some language properties simplify this analysis.

In this section, we present the symbolic analysis of a P4 program to generate a set of Header Validity Vectors (HVVs). First, we explain why in the case of P4 this analysis can be done. Then, we explain how a symbolic analysis can be done on a P4 program. Finally, we discuss the precision of the proposed analysis.

3.1 Symbolic Analysis Considerations

Symbolic analysis consists into evaluating possible values of variables at different positions in a program. Often such analysis only looks at symbolic values of specific variables and is limited in the information that can be considered since the number of possibilities is very high. However, in the present article, we leverage three properties of the P4 language to perform symbolic analysis. The first property is that we only consider the validity field of a header and that this field can only take two values Valid or Invalid. The second property is that any P4 program can be converted into a Directed Acyclic Graph (DAG), since the language does not allow loops [16]. In addition, the number of operations on any bytes of a packet is constant and determined at compile time, hence the number of possibilities is finite [11]. The third property is that the maximum number of headers is known at compile time and that all headers’ validity fields are initialized to Invalid. As a result, the symbolic analysis of headers is guaranteed to be finite.

While, the presented properties allow us to guarantee that the number of possibilities is finite and known, the total number of possible values could be very high. Considering \(N\) the number of declared headers, the total number of possible HVVs is at most \(2^N\). Being able to store all the header validity combinations could be very expensive on current technologies. As an example, if we have 64 declared headers and a data structure that requires one bit per header to store the validity bit we would need \(2^{64-8} = 2^{56}\) bytes of memory. This value counts all possible HVVs. As an example, considering the code in Listing 3 we have sixteen possible HVVs shown in Table 1. Each column represents a possible HVV value.

Fig. 3.

Table 1.

Header	HVV, Invalid (I), Valid (V)
ethernet	I	I	I	I	I	I	I	I	V	V	V	V	V	V	V	V
tunnel	I	I	I	I	V	V	V	V	I	I	I	I	V	V	V	V
ipv4	I	I	V	V	I	I	V	V	I	I	V	V	I	I	V	V
tcp	I	V	I	V	I	V	I	V	I	V	I	V	I	V	I	V

Table 1. HVV Possible Values

While all of these cases are possible, there are very unlikely. The main reason derives from the fact that protocol headers aim at specific functionalities. Each functionality can be considered as a layer in the protocol stack. Looking back at Table 1, tcp is always after an ipv4 header. As a result, the cases where tcp is valid with ipv4 invalid should not happen. Also, the case where all headers are invalid is highly unlikely. We can then consider that in practice the number of combinations is significantly smaller than the upper-bound we presented. As an example, Gibb et al. have presented a parser graph for a switch supporting four types of networks : enterprise, data center, internet service provider, and edge [14]. The resulting parser graph contains 22 nodes and 677 paths which would results in our case with a set of HVV containing 677 entries which is significantly less to the more than 4 million of possible combinations.

3.2 Performing Symbolic Analysis of a P4 Program

The symbolic analysis generates a set of HVVs. Since this set is used to specialize the program, the set is generated by evaluating each node of the program DAG. However, only two categories of nodes have an impact on the symbolic execution results. The condition category, containing select, if-then-else, and tables, which increases the number of paths. And header method category, containing setValid, setInvalid, and extract methods which can modify the validity of a header.

At the beginning of the symbolic execution, a set of vectors is generated. Each vector is composed of tuples indicating for each header if it is valid or not. The P4 program is analyzed in the order of execution. Each P4 block (parser and control) takes as input the header structure either as input, output, or input/output. In the cases of input-only blocks, the header structure is not modified, as a result, those blocks are skipped. For the other blocks, the set of vectors is passed. The first block is visited with a set containing one vector of headers initialized to invalid as defined in the P4 specification [11].

3.2.1 Symbolic Execution of the Condition Category.

Nodes from the condition category contain branches. The select and the table nodes can have any number of branches. In the select node, the number of branches depends on the number of cases while in the table node this is determined by the number of actions. For the if-then-else node, there are at most two branches: the then block and the else block.

When a node in the condition category is encountered, Algorithm 1 is applied. The algorithm takes as input \(HeadersV\), a set of HVVs, and output an updated set of HVVs. First, the algorithm initializes an empty set of HVVs, \(retHdrV\). Then each branch of the node is visited with the initial set \(HeadersV\). The results of each visit are merged to \(retHdrV\). Finally, \(retHdrV\) is returned.

3.2.2 Symbolic Execution of a Header Method Category.

When a method node is encountered, Algorithm 2 is applied. The algorithm takes as input \(HeadersV\), a set of HVVs, and outputs the updated set. The algorithm looks at the method, if it is not one of the three methods that can modify header validity, it returns the unaltered set. Otherwise, we extract the header that is altered by the method. Then we determine the validity which is assigned to the header. Finally, the header value of all HVVs of \(HeadersV\) is updated with the new validity and the updated set of HVVs is returned.

3.2.3 Complexity of a P4 Program Symbolic Execution.

During the symbolic execution of a P4 program, each node is visited only once. Hence, the visit complexity of a P4 program is \(O(M)\), where \(M\) is the total number of nodes. The visit of each method modifying the header validity (Algorithm 2) has a complexity relative to the number of HVVs. As discussed in Section 3.1, the worst case would be a set of \(2^N\) vectors, with \(N\) the number of headers. The complexity of Algorithm 1 depends on the merge which is \(O(n_1\times n_2)\) in the worst case where \(n_1\) and \(n_2\) are the number of elements of each set to merge. Since the worst-case size of a set is \(2^N\) we can say that the merge complexity is \(O(2^{2\times N})\) in the worst case. As discussed before, this very high complexity should be limited in real cases. As a result, implementations of symbolic analysis should integrate verification to avoid an excessively long compilation time. We do not address how such verification could be integrated. It could be a time limit while compiling or a limitation concerning the maximum size of the set of HVVs. We leave this question open for future work.

3.3 Minimal Set of Vectors

The symbolic execution generates a set of HVVs that must integrate every combination of headers that the P4 program can produce. The specialization of a P4 program then depends on the obtained set of HVVs. As a result, being able to generate a minimal set of HVVs without missing possible values allows for better specialization of programs.

Definition 3.1.

A set of vectors is considered minimal if it only contains vectors that can be generated by the P4 program, for any type of incoming packets.

The process to generate a set of vectors is applied to each P4 block. Since the set generated in a block is used to generate a set for the next block, if the first one is not minimal then subsequent sets are not minimal.

During symbolic analysis, Algorithm 1 is applied. In this algorithm, all possible branches of a condition are always considered. Since the condition is not evaluated, the output set of vectors is not guaranteed to be minimal. Thus, the output set of vectors of the symbolic analysis cannot be considered minimal. However, since all branches of a condition are taken, we can guarantee that the set contains all possible header combinations. We leave the generation of a minimal set of vectors for future work.

4 Generating A Deparser DAG

In P4, the deparser is expressed with a sequence of emit functions. Listing 4 shows an example when deparsing four headers : ethernet, tunnel, ipv4, and tcp. This sequence of emit functions can be transformed into a DAG as presented in Figure 3. In this section, we present the transformation of a deparser DAG to a transitively closed one. Then, we present the advantages of this transformation to improve deparser implementations.

4.1 Transformation of a Deparser DAG

Figure 3 represents a DAG equivalent to the code shown in Listing 4. In the Figure, the start and end states are respectively the entry and exit points of the code. The other nodes of the DAG execute the emit function on a header. The emit function emits a header if its validity bit is set [11].

As a result, the original DAG can be developed by converting each emit nodes into three nodes: an isValid node, a no-op node, and an insert node. The isValid node is the condition on the header validity. The no-op node represents an empty statement, it is connected to the isValid node when the condition result is False (i.e., the header is not valid). The insert node inserts the header and is connected to the isValid node when the condition is True (i.e., the header is valid). An example of a developed DAG is shown in Figure 4.

Fig. 4.

Finally, the developed DAG in Figure 4 can be converted into the transitively closed DAG presented in Figure 5. In the transitively closed graph, conditions are moved to edges as labels and nodes executes the insertion of a header. Because conditions are moved to edges, they can be evaluated before leaving the current node. In consequence, the no-op nodes can be removed, since the header to insert is valid. The condition to reach the end node is valid when there are no valid headers to emit. We use Algorithm 3 to generate the transitively closed deparser DAG.

Fig. 5.

Algorithm 3 takes as input the original DAG like the one shown in Figure 3 and generates a transitively closed DAG of a deparser. The algorithm extracts the list of \(nodes\) from the original DAG \(G_{ori}\) and initializes an empty DAG \(G_c\) and an empty list of previous nodes \(S\). We then go through each \(nodes\) in reverse order. Each node \(n\) is inserted into the new DAG \(G_c\). Then the new node \(n\) is linked with all the previous nodes \(ns\) in \(S\). To each of these new links, a condition is associated with it. This condition is constructed to guarantee that it is unique. Finally, the new node \(n\) is inserted to the list \(S\) of previous nodes with its associated condition.

In Algorithm 3, the transition condition is built by verifying that the header corresponding to the new node is valid and that all other headers of previous nodes are invalid. As a result, it is guaranteed that one edge is always valid. Also, because nodes are inserted in reverse order of the original DAG, the last node to be inserted is the first header to emit, and its condition is the broader one. As an example, when looking at Figure 5, from the start node, the condition to insert the ethernet header only requires the ethernet header to be valid, while the condition to insert the tcp header requires the tcp header to be valid and all the others to be invalid.

4.2 Evaluation of Transitively Closed DAG Implementation

Because the transitively closed DAG guarantees that at any given node there is only one possible next node, for any given packet, the conditions are mutually exclusive. As a result, every condition can be evaluated in parallel. Also, the latency to generate the packet is determined by the longest path in the deparser DAG. In this section, we discuss the implementation of the transformed deparser DAG on different targets.

4.2.1 DAG Implementation on a CPU.

Because a single-threaded CPU cannot evaluate multiple conditions in parallel, each condition of a transitively closed deparser DAG is evaluated sequentially. As a result, the code implementing the proposed DAG on a CPU contains many branches; hence the code size increases.

Also, when comparing the code for the DAGs presented in Figures 4 and 5 shown in Listings 3 and 4, it can be seen that executing both codes requires the evaluation of the validity of all headers. As a result, deparser DAG transitive closure by itself is not of interest in the case of CPU implementation.

Simplifying the DAG by removing edges could help reduce the longest path which would lead to latency reduction. This case happens every time emitted headers are mutually exclusive. A common example would be the case where TCP and UDP are possible headers. Often, only one of them is emitted.

Since the condition can be evaluated in parallel, multicore architecture might leverage some limitations. However, the conditions to evaluate are simple in regard to the complexity of a CPU core. Using multicore architecture would be expensive in regard to the possible performance gain.

Listing 5.

Listing 6.

4.2.2 DAG Implementation on Specialized Architectures.

Because the conditions of the transitively closed DAG can be evaluated in parallel and their evaluation requires standard logic units, specialized hardware implementations are good candidates to evaluate the deparser DAG. To evaluate the condition, a specialized architecture would require a module to evaluate the conditions in parallel. Such module would not be expensive in resource usage since they would only require the integration of AND gates and inverters.

Also, as explained by Luinaud et al. reducing the number of edges after a state reduces the size of the priority encoder and the interconnection complexity [17]. Indeed, on a specialized architecture, each DAG path corresponds to a set of interconnections to connect possible header output bits to the output bus. Once the DAG is closed, each edge removal reduces the total number of paths and the overall interconnection complexity. As a result, reconfigurable targets, such as FPGAs, would be good candidates for a combination of the DAG closure with optimization.

5 P4 Program Specialization

In Section 3, we presented the symbolic analysis of headers which generates a set of HVVs. A set of HVVs at a program point can be used to specialize the rest of the program. This specialization, using the set of HVVs, consists of removing unreachable codes. To determine unreachable code, a partial evaluation of the code can be done using the information available in the set of HVVs.

In this section, we explore P4 program specialization partial evaluation with the results of symbolic analysis. First, we discuss the use of partial evaluation to specialize P4 program structures. Then, we present the specialization of the deparser block through edge removal of its transitively closed graph. Finally, we show that the specialization can generate an optimal deparser DAG.

5.1 Specialization of P4 Blocks

A P4 program, as explained in Section 2, is composed of blocks that process packet headers. In general, the first block is the parser, followed by processing blocks, and finally, the deparser, which assembles the modified headers with the payload to emit the new packet. Since the parser extract headers from the packet, specializing it with the result of the symbolic analysis is not of interest. As a result, the specialization of a P4 program consists into specializing the processing blocks and the deparser block which are P4 control types.

The specialization of the processing block and the deparser block consists of removing any branch that would be unreachable using partial evaluation. In P4 there are two constructs, which have conditional branches and can be used inside a control block, if-then-else and table.

The table executes a search on values inserted at runtime. Since these values are inserted at runtime, performing partial evaluation on a table consists into an evaluation of all the possible actions. The if-then-else is based on a condition. In this case, partial evaluation could result in dead code elimination. However, some important limitation exists to perform the specialization. First, on a processing block headers can be modified, hence, the symbolic results must be kept up to date during partial evaluation. Also, only headers validity fields are known which limits the possible optimization to conditions that only look at header validity.

When looking at those different elements, the specialization would be the most efficient on a block, which does not modify headers and only look at the validity field of headers. As a result, the deparser is the preferred block to perform specialization using partial evaluation with symbolic analysis results, we leave to future work the optimization of the other control blocks using partial evaluation. Section 5.2 explains how deparser specialization can be performed.

5.2 Specializing the Deparser

To specialize the deparser, we take a set of HVVs obtained after symbolic analysis. The specialization is performed on a deparser closed DAG described in Section 4.1. Because the transitively closed DAG has independent edge conditions, we have to keep only edges that validate at least one HVV. As a result, we propose Algorithm 4 to specialize a deparser.

Algorithm 4 takes two inputs. The first input is a set of HVVs \(HeadersV\). The second input is the deparser transitively closed graph \(G_c\). Finally, the algorithm outputs \(G_o\), the optimized DAG.

To generate \(G_o\), first, it is initialized as an empty DAG. Then, we insert to \(G_o\) all valid edges, from the closed graph \(G_c\), a valid edge has a condition that is true for at least one HVV. Finally, for any node in \(G_o\) having only one output edge, the edge condition is removed since there is only one possible transition, hence the condition is always valid.

To determine all the valid edges, we go through each HVV in set \(HeaderV\). For each HVV, the deparser DAG \(G_c\) is traversed. During the traversal, for each node \(n\) in \(G_c\), a valid transition condition is searched according to the current HVV. Once a valid condition is found, the corresponding edge is inserted into \(G_o\) if it is not already present. The next node of \(G_c\) is then processed.

5.3 Optimal Deparser DAG

An important part of the proposed optimization method is to prove the optimality of the solutions it produces. As a step in that direction, we define an optimal deparser DAG as: in Definition 5.1.

Definition 5.1.

An optimal deparser DAG is a DAG comprising the minimum set of nodes and edges between the start and end that allows the correct emission of any packet to be processed by a given P4 program.

Lemma 5.2.

If the output set of the symbolic analysis contains only one vector, then the optimized DAG contains only one path between start and end nodes.

Proof.

If the symbolic analysis outputs only one validity vector header, it means that the P4 program only defines one combination of valid and invalid headers. Hence, only one type of packet is emitted. As a result, there is only one valid path in the deparser DAG.□

Theorem 5.3.

Algorithm 4 generates an optimal deparser DAG according to the symbolic analysis results.

Proof.

Let us consider Algorithm 4 in the case where there is only one vector in the input set \(H_L\). As stated in Lemma 5.2, only one path should be generated. The first loop: for each \(H_t\) in \(H_L\) do contains only one iteration. Because \(G_c\) is a DAG, each node is visited only once. For each node, only one output edge is inserted to the new graph \(G_o\), and this edge condition is valid for the given valid headers. Therefore, \(G_o\) only contains nodes that can be reached with the given valid headers and each node has only one output edge. As a result, the graph has only one path.

Let us now consider what happens when \(H_L\) contains more than one vector. In the first iteration, an optimal DAG is generated for the current \(H_t\). In the next iteration, a new \(H_t\) is selected. For each selected edge, two cases can happen. One is when the edge is already in \(G_o\). In this case, the DAG is not modified, \(G_o\) is still optimal. The second case is when \(G_o\) does not contain the edge. In that case, the edge is inserted into \(G_o\). As a result, only the required edges have been inserted. Therefore, the DAG \(G_o\) is optimal. We can repeat the process for all other iterations.□

This demonstration shows that the proposed Algorithm 4 can generate an optimal deparser DAG. However, this algorithm is dependent on the precision of the header symbolic analysis results presented in Section 3.2. Not having the minimal set of vectors implies that the reduced deparser DAG contains paths that are never traversed at runtime. This might result in increased resource usage when implementing the deparser.

6 P4 Deparser Specialization Example

In Sections 3 and 5, we described the processes to analyze and specialize a P4 program. In this section, we give an example of specialization applied to the deparser block of a P4 program. The P4 program is composed, in order, of Listings 2, 3, and 4. In the following section, the result of the symbolic analysis of the parser block and the processing block is presented. Then, in a second section, the results of deparser partial evaluation, with the results of symbolic analysis, is shown.

6.1 Example of Symbolic Analysis

As explained in Section 3, the symbolic analysis generates a set of HVVs indicating header status. The set is generated by the symbolic analysis of the P4 program. The results of a symbolic analysis, for Listings 2 and 3, are shown in Table 2. The first analyzed block is the parser block shown in Listing 2. The second analyzed block is the processing block shown in Listing 3. At the entry of the parser block, the set is initialized with a set of one HVV with the validity field set to False, as shown in Table 2 in the line “Original set of HVVs”.

Table 2.

	ethernet	tunnel	ipv4	tcp
	Header Validity Field Value
Original set of HVVs	False	False	False	False
set of HVVs after analysis of Listing 2 (parser)	True	False	False	False
	True	True	False	False
	True	False	True	False
	True	False	True	True
set of HVVs after analysis of Listing 3 (processing)	True	False	False	False
	True	True	False	False
	True	False	True	False
	True	False	True	True
	True	True	True	False
	True	True	True	True

Table 2. Example Set of HVVs after Symbolic Analysis of Listings 2 and 3

6.1.1 Parser Symbolic Analysis.

After the parser block (Listing 2), the symbolic analysis shows that there are four possible HVVs. We can see that the ethernet header is always valid. When looking at the analysis process, at the state start, all header validity bits are False. When the first extract method is evaluated (Line 2), the ethernet validity is set to True. Then there are three branches, one to accept, one to parse_tunnel, and one to the final state parse_ipv4. As described in Algorithm 1, the resulting sets of HVV of each branch evaluation are merged. The branch, accept, returns a single HVV with only ethernet valid. The branch parse_tunnel returns a single HVV having ethernet and tunnel valid. The last branch, parse_ipv4 returns a set of two HVVs, one with ethernet and ipv4 valid and the second vector with ethernet, ipv4, and tcp valid.

6.1.2 Processing Block Symbolic Analysis.

The symbolic analysis of the processing block (Listing 3) results in a set of six HVVs as shown in Table 2. The analysis begins with the set of HVVs obtained after the parser symbolic analysis. In the processing block, the table ipv4_lpm (Line 8) is the only block that modifies the header. Indeed, the only processing parts which can modify headers validity fields are the actions of this table. As explained in Section 3.2.1, analyzing a table is considered as a condition analysis. In the case of Listing 3, the table ipv4_lpm can execute two actions: encapsulate (Line 2) and forwarding (Line 5). The forwarding action does not alter the validity of any header. As a result, the output set after symbolic analysis of this action is the same as the input set, which is the same as the parser outputted set. The encapsulate action validates the tunnel header. Therefore, after the encapsulate action, the output set contains HVVs in which tunnel headers are valid. Then by merging the outputs of both branches analyzes we obtain the set of HVVs shown in Table 2.

6.2 Partial Evaluation of the Deparser

Once the symbolic analysis has been performed, we can optimize the deparser by specializing it. Here, we consider the deparser shown in Listing 4. As discussed in Section 5.2, the deparser specialization is performed on the closed graph. The closed deparser graph is presented in Figure 5 and is generated using Algorithm 3. The specialization is done through partial evaluation with the output set of HVVs of the processing blocks and using Algorithm 4. The resulting graph is shown in Figure 6.

Fig. 6.

As noted in Section 6.1, the ethernet header is always valid. Therefore, the only output link from the start node is the one connected to the ethernet node. Since there is no other edge, the label of this edge is removed. We can also note that the tcp header is only valid when the ipv4 header is valid, thus all valid paths between start and tcp are going through ipv4. As a result, the only edge going to tcp is the edge ipv4-tcp.

In the presented example, we do not take into consideration that the ethernet header might be invalid. As a result, a static dataflow analysis would remove half of the transition and lead to a graph close to the proposed one. However, on a larger program, a static dataflow analysis would not be as efficient as the analysis proposed in this work. As an example, a switch might have to process IPv4 and IPv6 packets but both protocols would never be on the same packet. Also if a switch’s output ports have different types of physical interfaces like a Wi-Fi port and an Ethernet port, the Ethernet header could be replaced with another one depending on the output port. In such cases being able to correctly determine that both protocol headers are never valid at the same time requires a symbolic analysis.

7 Methodology AND Results

7.1 Evaluating Graph Optimization

As explained in Section 4.2, FPGA implementations have the potential to benefit the most from deparser graph optimization. In this work, we focus on FPGA resource usage improvement. To evaluate our solution, we use the architecture and the Python code¹ proposed by Luinaud et al. [17]. The Python code exploits the NetworkX² library to analyze the deparser DAG and to generate a deparser architecture. Consequently, we generate the deparser DAG of a P4 program in a JSON file readable by NetworkX. The DAG is generated by the P4 compiler (P4c) [20] for which we developed a specific backend to generate the JSON file.³ Finally, we modified the Python code of Luinaud et al. to read the DAG from the generated JSON file and convert it to VHDL. The Python code also requires to specify the output bus width of the deparser.

7.2 Presentation of Test Cases

We applied the proposed optimization algorithm to the following programs⁴:

–

T0 : Ethernet, tunnel, IPv4, TCP

–

T1 : Ethernet, IPv4/IPv6, TCP/UDP

–

T2 : Ethernet, IPv4/IPv6, TCP/UDP, ICMP/ICMPv6

–

T3, T4, T5, and T6 : Ethernet, 2 \(\times\) VLAN, 2 \(\times\) MPLS, IPv4/IPv6, TCP/UDP, ICMP/ICMPv6

T0 corresponds to the example program presented in this article. T1, T2, and T3 are extracted from [17].

Programs T0, T1, T2, and T3 are used to show the impact of increasing the number of headers. T4 has a deparser that emits the same header as T3 but in a different order. This lets us evaluate the impact of the order of header emission. In T4, the parser always parses the Ethernet header. For the remaining headers, the parser can parse any combination of headers after the Ethernet header. The compilation results of this program show the impact of forcing one header to always be emitted (Ethernet) with all the other combinations of headers possible. Programs T5 and T6 have the same parser and deparser as T4. In T5, the processing part forces the UDP header to be valid, while in T6 the processing part forces the UDP header to be invalid. Hence, with T5 we can evaluate the influence of setting one header to always be emitted. With T6 we can evaluate the effect of not emitting one header.

7.3 Results

To measure the reduction in FPGA resources usage, each program was compiled and implemented with and without optimization. Compilation results are presented in Table 3, the table presents for each program, the number of nodes and paths of the deparser DAG. The specialized versions correspond to the result using the optimization proposed in this article, unspecialized versions are compilation results without optimizations. The compilation results shows that the number of paths decreases significantly when comparing the specialized and the unspecialized versions. When looking at T3, the number of paths is divided by more than 15 between the specialized and unspecialized versions. When looking at T6, the optimized DAG contains 12 nodes instead of 13 in the unspecialized version which correspond to the fact that the UDP header is never emitted.

Table 3.

Program		T0	T1	T2	T3	T4	T5	T6
Specialized	# Nodes	6	7	9	13	13	13	12
Specialized	# Paths	6	7	9	129	1,024	512	512
Unspecialized	# Nodes	6	7	9	13	13	13	13
Unspecialized	# Paths	16	32	128	2,048	2,048	2,048	2,048

Table 3. Deparser Compilation Results of P4 Programs with and Without Specialization

The generated DAGs were then implemented for a Xilinx Ultrascale+ (model xcvu3p) FPGA using Xilinx Vivado 2019.1, with a 64-bit, a 256-bit, and a 512-bit output bus. Implementation results are presented in Table 4. The column labeled “unspecialized program” corresponds to the implementation results produced using the original commonly used P4 compiler [20]. These results are used as the reference to quantify the benefits of the proposed optimization method. The column labeled “specialized program” corresponds to the implementation results obtained after applying the optimizations proposed in this article. The column “gain”, shows the gain between the specialized and unspecialized implementation results.

Table 4.

Bus width	Code	Unspecialized Program			Specialized Program			Gain
Bus width	Code	LUTs	FFs	Freq. (MHz)	LUTs	FFs	Freq.(MHz)	LUTs	FFs	Freq.
64 bits	T0	836	441	699	474	374	740	76 %	17.9 %	5.5 %
	T1	1,464	393	740	614	397	740	138 %	\(-\)1 %	0 %
	T2	1,681	393	680	718	402	740	134 %	\(-\)2.2 %	8.1 %
	T3	2,316	399	694	1,053	375	666	120 %	6.4 %	\(-\)4.2 %
	T4	2,355	399	653	1,241	372	699	90 %	7.3 %	6.6 %
	T5	—	—	—	1,124	374	746	110 %	6.7 %	12.5 %
	T6	—	—	—	1,116	373	671	111 %	7 %	2.7 %
		(“—” indicates same results as previous line)					Average	111 %	6 %	4.5 %
256 bits	T0	1,613	1,353	684	1,293	1,294	671	24.8 %	4.6 %	\(-\)1.9 %
	T1	2,633	1,493	662	1,557	1,373	684	69.1 %	8.7 %	3.2 %
	T2	3,907	1,575	662	1,514	1,378	662	158.1 %	14.3 %	0 %
	T3	5,526	1,586	666	3,452	1,507	662	60.1 %	5.2 %	\(-\)0.6 %
	T4	6,154	1,597	666	4,077	1,568	680	51 %	1.9 %	2.1 %
	T5	—	—	—	4,195	1,572	675	47 %	1.6 %	1.3 %
	T6	—	—	—	3,714	1,519	662	66 %	5.1 %	\(-\)0.6 %
							Average	68 %	6 %	0.5 %
512 bits	T0	5,992	2,635	625	4,824	2,525	625	24.2 %	4.4 %	0 %
	T1	6,732	2,908	625	5,489	2,690	628	22.7 %	8.1 %	0.5 %
	T2	8,395	3,114	653	5,552	2,693	625	51.2 %	15.6 %	\(-\)4.5 %
	T3	12,415	3,302	632	9,400	3,130	609	32.1 %	5.5 %	\(-\)3.8 %
	T4	12,165	3,260	632	9,811	3,178	632	24 %	2.6 %	0 %
	T5	—	—	—	8,626	3,064	625	41 %	6.4 %	\(-\)1.1 %
	T6	—	—	—	7,911	2,981	625	53.8 %	9.4 %	\(-\)1.1 %
							Average	35.6 %	7.4 %	\(-\)1.4 %

Table 4. Deparser Implementation Results of P4 Programs with and Without Specialization

When comparing results obtained with a 64-bit, a 256-bit, and a 512-bit output bus, it can be seen that the output bus width has an important impact on resource usage. However, the size of the output bus impacts the maximum throughput that can be achieved at a given clock frequency. As an example, in the case of the specialized implementation of T3, the maximum clock frequency that can be achieved with an output bus of 64-bit is 666 MHz while it is 662 MHz with an output bus of 256-bit and 609 MHz with an output bus of 512-bit. This leads to a maximum throughput of 42 Gb/s with an output bus of 64 bits, 169 Gb/s with an output bus of 256-bit, and 311 Gb/s with an output bus of 512-bit. This increase in throughput comes at a cost of increase in resource usage.

The results presented in Table 4 show that the deparser DAG specialization mainly impacts LUTs consumption. In the case of output buses of 64 and 256 bits, FF usage is reduced by 6% in average. In comparison, LUT usage for an output bus of 64 bits is divided by more than 2 in average. With an output bus of 256 bits, the LUT usage is reduced by 68% in average. In the case of an output bus of 512 bits, the FF usage is reduced on average by 7% while the LUT usage is reduced by 35%. When looking at the clock frequency, there is no clear trend between the specialized and unspecialized implementation, However, both versions achieve frequency over 600 MHz. As a result, we focus on LUTs consumption in the remaining of the result analysis.

The programs T0, T1, T2, and T3 evaluate the impact of resource consumption when increasing the number of headers to emit. It can be seen from Table 4 that even with small deparsers, specialization has a sizable impact on the LUT usage. As an example, for T0, the specialized version reduces by more than 75% the usage of LUTs with an output bus of 64 bits and by almost 25% with an output bus of 256-bit and 512-bit compared to the unspecialized deparser implementation.

Also, when comparing the results of T2 without specialization and T3 with specialization, it can be seen that the number of nodes impacts the resource usage. While the unspecialized version of T2 has almost the same number of paths as the specialized version of T3, the LUT is divided by more than 3 with output buses of 64-bit and 256-bit. With an output bus of 512 bits, the LUT usage is divided by more than 2. This impact on the number of nodes is also shown when comparing the implementation results of T5 and T6 with output buses of 256 and 512 bits. In both cases, the LUT usage of T6 is about 10% lower than the LUT usage of T5.

In addition, when comparing the LUT usage between the specialized and unspecialized programs for T4, it can be seen that the number of paths also influences resource usage. Indeed, the specialized version of T4 has half the number of paths found in a non-optimized DAG with the same number of nodes. With a 64-bit output bus, the number of LUTs is almost divided by 2, while with a 256-bit output bus, the optimized version requires two-thirds of the LUTs compared to the non-optimized case.

In Table 5, we compare the deparser implementations of T2 and T3 with other deparser implementations, one generated with Xilinx SDNet [24] and one proposed by Benáček et al. [3]. The deparser generated with Xilinx SDNet consumes \(10 \times\) more LUTs and \(40 \times\) more FFs than the deparser generated in this work. Also the solution with Xilinx SDNet requires BRAMs. When comparing our deparser with the deparser proposed by Benáček et al. [3], our solution consumes about \(10 \times\) fewer slices compared to the solution of Benáček et al. while achieving more than \(2 \times\) their throughput.

Table 5.

Code	Work	Slices	LUTs	FFs	BRAMs	Throughput
T2	This work	2.2 k	5.6 k	2.7 k	0	320 Gb/s
	Xilinx SDnet	N/A	98 k	119 k	149.5	240 Gb/s
	Benáček et al. [3]	20 k	N/A	N/A	N/A	120 Gb/s
T3	This work	2.5 k	9.4 k	3.2 k	0	310 Gb/s
	Xilinx SDnet	N/A	139 k	165 k	229.5	240 Gb/s
	Benáček et al. [3]	24 k	N/A	N/A	N/A	120 Gb/s

Table 5. Comparison of Deparser Implementation with Previous Work with an Output Bus of 512 Bits

8 Related Work

8.1 Program Specialization

Coen-Porisini et al. proposed a method to perform symbolic execution of the Ada programming language. The article focuses on the reusability of code by function specialization. They show that symbolic execution gives more semantic information [10]. Schultz et al. have demonstrated that in the case of Java applications, program specialization can improve performance without modification of the source code [21]. Bubel et al. used both symbolic execution and partial evaluation to optimize Java program execution [8]. Brady and Hammond looked at partial evaluation in the case of an embedded DSL. They have shown that using partial evaluation can help in the generation of efficient interpreters that can match compiled code performance [7]. Herrmann and Langhammer proposed to use partial evaluation to accelerate DSL evaluation. Like Brady and Hammond, they were able to generate DSL interpreters with performances closed to optimize compiled code [15]. Cui et al. proposed a tool that uses symbolic execution to optimize the computation of matrices [12].

While these works demonstrate the efficiency of program specialization, they do not cover the specific requirements of configurable data planes. Also, they focus on the generation of specialized software while in this work we focus on the optimization of hardware generation through program specialization.

8.2 P4 Program Analysis

Birnfeld et al. performed a data-flow analysis similar to the one presented in this work. They use data-flow analysis to uncover potential bugs in P4 programs [4]. In our work, we propose to use the same process but to optimize a deparser. Also, we integrate the analysis into a compiler. Cabal et al. proposed an architecture to implement deparsers on FPGAs. However, they do not explore compiler optimization for the architecture [9]. Dangeti et al. proposed to use LLVM to compile P4 programs. These previously reported approaches facilitate using optimization passes already available in the LLVM compiler [13]. However, none of these previous works propose compiler passes specific to programmable data plane programs. Finally, Abhashkumar et al. [2] and Wintermeyer et al. [23] proposed optimizations through high-level analysis. While Abhashkumar et al. proposed defining dedicated specifications for each application, Wintermeyer et al. derived suitable constraints through network traces analysis. The solutions proposed by [2, 23] require to analyze the network, while in our work only the input program is needed to optimize the implementation. As a result, the solution presented by Abhashkumar et al. and Wintermeyer et al. could be used to complement the optimizations we proposed in our work.

9 Conclusion

In this article, we proposed the specialization of P4 programs using symbolic analysis and partial evaluation by using specific properties of the P4 language. The specialization of P4 programs has been discussed to determine which P4 elements can benefit the most from the specialization. In addition, we proposed a deparser DAG transformation which has been combined with program specialization to reduce the implementation cost of deparsers found in programmable switches. This reduction has been demonstrated by compiling and synthesizing several P4 programs. Furthermore, we discussed the implementations of transformed deparser graphs on different targets and showed that highly parallel platforms can benefit the most from this transformation. Finally, we have demonstrated that the proposed specialization algorithm based on symbolic analysis can generate optimal deparser DAGs.

Our work enables future P4 optimizations. The symbolic analysis could be improved to take into consideration the computed information when evaluating conditions. The proposed code optimization methods could be integrated in other compilers such as the one proposed by Dangeti et al. [13]. They could also be extended to P4 programs verification at compile time and after compilation.

Finally, as part of this work, we envisioned language modification that could be interesting for future revisions of P4. Indeed, P4 header type integrates three methods regarding the header validity field. However, there are two other functions, defined in P4, which interact with the validity field, extract and emit. It could be of interest to force those functions to expressively use the headers built-in methods, this would allow future compiler passes to only focus on header built-in methods.

Acknowledgments

The authors thank Jeferson Santiago da Silva and Thibaut Stimpfling for their insightful comments.

Footnotes

https://github.com/luinaudt/deparser/tree/FPGA_paper.

https://networkx.org/.

https://github.com/luinaudt/p4c/tree/paperTACO.

⁴

Sources and results: https://github.com/luinaudt/P4CompilerOptimizationTest/.

References

[1]

2021. Intel® P4 Suite. Retrieved 15th October, 2021 from https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch.html.

Abstract

1 Introduction

2 The P4 Language

2.1 PHV in P4

2.2 P4 Parser Block

2.3 P4 Control Block

3 Symbolic Analysis On Headers

3.1 Symbolic Analysis Considerations

3.2 Performing Symbolic Analysis of a P4 Program

3.2.1 Symbolic Execution of the Condition Category.

3.2.2 Symbolic Execution of a Header Method Category.

3.2.3 Complexity of a P4 Program Symbolic Execution.

3.3 Minimal Set of Vectors

4 Generating A Deparser DAG

4.1 Transformation of a Deparser DAG

4.2 Evaluation of Transitively Closed DAG Implementation

4.2.1 DAG Implementation on a CPU.

4.2.2 DAG Implementation on Specialized Architectures.

5 P4 Program Specialization

5.1 Specialization of P4 Blocks

5.2 Specializing the Deparser

5.3 Optimal Deparser DAG

6 P4 Deparser Specialization Example

6.1 Example of Symbolic Analysis

6.1.1 Parser Symbolic Analysis.

6.1.2 Processing Block Symbolic Analysis.

6.2 Partial Evaluation of the Deparser

7 Methodology AND Results

7.1 Evaluating Graph Optimization

7.2 Presentation of Test Cases

7.3 Results

8 Related Work

8.1 Program Specialization

8.2 P4 Program Analysis

9 Conclusion

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Design Principles for Packet Deparsers on FPGAs

Combining Program and Data Specialization

Accelerating OCaml Programs on FPGA

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations