When an accelerable pattern is found, the IR code should be replaced with the corresponding API provided by CIM accelerators. In this section, we describe the details of code conversion, including finding key variables, resolving data dependencies, and API construction and replacement.
5.2 Recognizing Key Variables
When an accelerable pattern is found, we have to find out some key variables within it to complete the code conversion. These variables are related to the interface provided by the CIM accelerator, including the starting address and the size of the matrix or the vector involved in the accelerable pattern. Then, the CRE can read the data from main memory and map it to ReRAM crossbar arrays according to the starting address and the data size. When the CIM accelerator completes the computation, the CRE also writes the output result to main memory.
For matrix-vector multiplications and matrix-matrix multiplications, we can format these operations uniformly as follows:
For an MVM, the variable m is 1, and thus we only need to find two other variables n and k. The MVM is done in a two-level nested loop while the MMM is performed in a three-level nested loop. For an MMM operation, the variable m can be found in the condition block of the outer loop.
The dimension of a vector (\(V_{dim}\)) can be easily found in loop condition blocks. In general, we can perform a head-to-tail traversal or a tail-to-head traversal to infer the dimension of a vector. In the loop control statement, the loop control variable (i) is initialized, tested, and changed when the loop executes. We denote its initial value as LoopVarInit and the range control value as RangeVar. All of them appear in the loop.cond block definitely.
For the head-to-tail traversal, the initial value of i (LoopVarInit) is usually 0, and should be no more than RangeVar. When the comparison symbol is \(\lt\) or \(\le\), the dimension of the vectors should be RangeVar or \(RangeVar + 1\), respectively. For the tail-to-head traversal, the initial value of LoopVarInit is usually larger than or equal to the RangeVar. Thus, the dimension of the vector becomes \(LoopVarInit + 1\) regardless of the comparison symbol (\(\gt\) or \(\ge\)).
Fortunately, the the loop condition block can be easily recognized in the IR. Although the IR can be generated from source codes or a binary executable, the IR blocks corresponding to the loop control statement are almost the same. By searching the “icmp” instruction, we can recognize the jump condition and the RangeVar. For example, in Pattern 1, “<cmp_1>” in the block loop.condA represents the comparison statement, and [v_1] represents the RangeVar. For LoopVarInit, we can find it in the statement containing a load instruction, which use LoopVarInit to initialize the i, i.e., [r_1] represents LoopVarInit. Generally, the dimension calculated in the first loop condition block (e.g., loop.condA in Pattern 1) corresponds to the size of the input vector, while the dimension calculated in the second loop condition block (e.g., loop.condC in Pattern 1) corresponds to the number of columns or rows in the matrix.
To complete an MVM, the element of the input vector and a row of the matrix should be accessed alternately. For convenience, we use the starting address of the vector/matrix as its identifier. For a single element of a vector, its address can be represented by a[i] and calculated by \(starting\_add + offset\), where starting_add is the starting address of the vector a, and offset equals to the loop control variable (i) multiplied by the size of data.
As shown in Pattern 2, the variable %input_addr in the line 15 represents the starting address of the input vector, and %a_1 represents the offset. However, the position of these two variables may be not fixed in the IR. Fortunately, we find that %input_addr only appears in the innermost loop body block, while the loop control variable %a_1 is also involved in the loop control block. Thus, we can still distinguish these two variables by tracking their positions in different blocks. Due to the static single assignment form (SSA) property of the IR, a variable may correspond to several variants with different names, but they have the same value. Thus, we construct an one-to-many mapping between the original loop control variable and its variants, and associate the starting address of the vector with its dimension, which corresponds to the loop control variable.
For each element in a matrix, e.g., m[i][j], its address can be also calculated by \(starting\_add + offset\). However, unlike the offset in an one-dimensional vector, an additional add operation is required to calculate the offset here, i.e., \(addr = first\_addr + i\times col\_size \times data\_size + j\times data\_size\). Since only two variables can be added in each addition operation, total two add instructions are required, as shown in the block_D of Pattern 2. Therefore, to find the starting address and the dimension of the matrix, we also need to identify and distinguish these three variables from each other by checking whether the variable appears in the loop control block. Since the loop control variable (i) corresponding to the rows of the matrix is involved in one more multiplication than the loop control variable (j), it is possible to distinguish these two variables that correspond to the rows and columns of a matrix. Once the dimensions of the input vector and the matrix are determined, we can easily deduce the dimension of the output vector. However, its starting address still should be determined in the same way as the above.
To support general GEMV and GEMM, we need to further identify coefficients before matrices and vectors, such \(\alpha \times a[m] \times B[m \times n] + \beta \times c[n]\). The \(\beta\) appears in the penultimate loop body block and multiplies each element of the output vector, and the \(\alpha\) appears in the innermost loop body block and multiplies the result of the MAC.
For a bitmap logical operation, the variable of two bitmaps can be easily found from the loop body of a single-level loop, as shown in Pattern 6. Then, we can infer the dimension of the bitmap from the loop condition block (i.e., the RangeVar).
In summary, we can recognize the dimensions of the input vector and the matrix from the loop control block, and recognize their starting addresses from the innermost loop body block. Moreover, the whole loop structure is also used to establish the mapping between the starting address and the dimension of the input vector or the matrix. Although these variables can be easily identified from the IR code, the dimension of matrix or vector is usually only known at runtime. Thus, we still need code instrumentation techniques to track these variables at runtime, and then determine whether the MVM/MMM operation should be run on the CIM accelerator.
5.4 Code Instrumentation
To accelerate IR patterns described in Section
4, we need to replace these code snippets with newly generated CIM function calls. For library functions of MVM and MMM, we can simply replace the original function call. However, we can not do this for other accelerable patterns. Since the LLVM IR is based on the SSA form, some variables defined in an IR block may be referenced by instructions in other blocks. Therefore, if we directly delete the accelerable IR blocks, then the program may not be compiled or gets incorrect results. In this case, to guarantee the correctness of the program, we have to modify all instructions related to the deleted variables. This is a time-consuming work and is difficult for automatic code conversion in the IR.
Our compilation tool replaces the accelerable IR blocks using a more smart approach. Figure
8 illustrates an example of code conversion in the IR. We construct and add a new block, and change the original execution logic seamlessly. We do not delete any instructions in the accelerable IR blocks so that the following involved instructions can still execute without errors. However, we change the execution flow to skip the original accelerable block (i.e.,
for.body3), and jump to the newly added block (i.e.,
block_mvm_0) in which the MVM operation is offloaded to the CIM accelerator. We place the newly added block just before the innermost loop body block (i.e.,
for.body3), because the loop control variables have been defined and initialized at this time. Thus, we can fully utilize these original variables to determine the arguments of the
cim_mvm function. When the function call returns, it directly jumps to the
for. end12 block and the loop is finished.
Since there may be more than one accelerable pattern of the same type in a single function, we give each newly added block a unique name to avoid conflicts. Moreover, since some patterns may be not accelerated by the CIM accelerator, we add instructions in the beginning of the newly added block to calculate the net benefit according to the sizes of input vectors and matrices, and then determine whether the program continues to execute the CIM function call or it jumps back to execute the loop on CPUs. In this way, we can achieve conditional computation offloading based on the performance model.
Limitation. Although our compilation tool can automatically transfer legacy programs to binary executables that can offload some accelerable codes to the CIM accelerator, it still has some limitations. For example, for several application domains such as graph processing and genome analysis, the original algorithm usually needs some additional transformation to transfer to MVM, MMM, or bitmap logical operations. These cases cannot be easily supported by our compilation tool. However, since matrices and vectors are widely used in most deep learning and database applications, our compilation tool naturally support these applications for automatic code migration.