2.1 CNN Computational Model
A CNN is a computational model [
2], commonly represented as a directed acyclic computational graph
\(CNN (L, E)\) with a set of nodes
\(L\), also called
layers, and a set of edges
\(E\). An example of a CNN model with
\(|L|\) = 5 layers and
\(|E|=\) 5 edges is given in Figure
1(a). Every layer
\(l_i \in L\) represents part of the CNN functionality. It performs operator
\(op_i\) (convolution, pooling, etc.), parameterized with hyper-parameters
\(hyp_i\) (kernel size, stride, borders processing mode, etc.), and learnable parameters
\(par_i\) (weights and biases, etc.). We define a layer as a tuple
\(l_i = (op_i, hyp_i, par_i)\), where
\(op_i\) is the operator of
\(l_i\),
\(hyp_i\) are the hyper-parameters of
\(l_i\), and
\(par_i\) is a list of multi-dimensional arrays, called
tensors [
2], where every tensor
\(par_{ik}\in par_i\) stores a set of learnable parameters (weights or biases) or layer
\(l_i\). An example of a CNN layer
\(l^1_2=(Conv, \lbrace ks:5, s:1, bm: same\rbrace ,\lbrace [8,3,5,5], [8]\rbrace\) is shown in Figure
1(a). Layer
\(l^1_2\) performs convolutional operator
\(op^1_2=Conv\), parameterized with three hyper-parameters (kernel size
\(ks\) = 5, stride
\(s\) = 1, and borders processing mode
\(bm\) =
\(same\)) and parameters
\(par^1_2=\lbrace [8,3,5,5], [8]\rbrace\), where
\([8,3,5,5]\) is a four-dimensional tensor of the layer weights and [
8] is one-dimensional tensor of the layer biases.
Every edge
\(e_{ij} \in E\) specifies a data dependency between layers
\(l_i\) and
\(l_j\) such that data produced by layer
\(l_i\) is accepted as an input by layer
\(l_j\). We define an edge as a tuple (
\(i, j, data\)), where
\(i\) and
\(j\) are the indexes of the layers connected by edge
\(e_{ij}\);
\(data\) is the data exchanged between layers
\(l_i\) and
\(l_j\) and stored in a tensor of shape [
\(batch, Ch, H, W\)], where
\(batch, Ch, H, W\) are the tensor batch size [
2], the number of channels, the height and the width, respectively. An example of edge
\(e^1_{12}=(1,2, [1,3,32,32])\) is shown in Figure
1(a). Edge
\(e^1_{12}\) represents a data dependency between layers
\(l^1_1\) and
\(l^1_2\), where layer
\(l^1_1\) produces a data tensor [1, 3, 32, 32] with batch size = 1, number of channels = 3, height and width = 32, accepted as input by layer
\(l^1_2\).
2.2 Parallelism, Available within a CNN
As a computational model, the CNN is characterized with large amount of available parallelism. This parallelism can be exploited to speed up CNN inference and to efficiently utilize the computational resources of a platform where the CNN is deployed. The most well-known and widely exploited type of parallelism available within CNNs is
data-level parallelism. This type of parallelism involves the same computation, such as convolution, performed by a CNN layer over the CNN layer input data partitions. It allows to speed up CNN inference by accelerating the execution of individual CNN layers on parallel processors such as
Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). The data-level parallelism available within CNNs is exploited by most of the existing DL frameworks, such as TensorFlow [
1] or PyTorch [
22].
Another type of parallelism available within a CNN is known as
task-level parallelism or
pipeline parallelism [
18,
31] among CNN layers. This type of parallelism is related to the streaming nature of CNN-based applications, where the application accepts different input frames (images) from an input data stream. When a CNN is executed on a platform with multiple processors, the frames from the input data stream can be processed in a pipelined fashion by different layers of the CNN deployed on different processors. Figure
2 shows an example where
\(CNN^2\), introduced and explained in Section
2.1, is executed in a pipelined fashion on a platform with two processors: a
Central Processing Unit (CPU) and a GPU.
The layers of
\(CNN^2\), representing computations within the CNN, are distributed over the platform processors: layers
\(l^2_1\) and
\(l^2_2\) are executed on the GPU, whereas layers
\(l^2_3\) and
\(l^2_4\) are executed on the CPU. These layers form two CNN sub-graphs also referred to as
partitions [
18,
31], annotated as
\(P_2\) and
\(P_3\). Partition
\(P_2\) accepts frames from the application input data stream, processes these frames as specified by layers
\(l^2_1\) and
\(l^2_2\), and stores the results into a buffer associated with edge
\(e^2_{23}\). Partition
\(P_3\) accepts the frames processed by partition
\(P_2\) from edge
\(e^2_{23}\), further processes these frames, and produces the output data of
\(CNN^2\). Partitions
\(P_2\) and
\(P_3\) are executed on different processors in the platform and do not compete for the platform computational resources. Thus, when applied to different data (i.e., different frames), the partitions can be executed in parallel. In Figure
2, partitions
\(P_2\) and
\(P_3\) process frames
frame 2 and
frame 1 in parallel. This leads to overlapping execution of layers belonging to different partitions and allows for faster inference of
\(CNN^2\) compared to conventional layer-by-layer execution. However, pipelined CNN execution involves memory overheads. As shown in Figure
2, edge
\(e^2_{23}\) of
\(CNN^2\) is duplicated between the partitions
\(P_2\) and
\(P_3\) (see edges
\(e^{2(1)}_{23}\) and
\(e^{2(2)}_{23}\) and the corresponding buffers). Such duplication, called
double buffering [
13], is necessary for execution of the CNN as a pipeline. It prevents competition between the partitions when accessing data associated with edge
\(e^2_{23}\). If double buffering is not enabled, the CNN partitions compete for access to edge
\(e^2_{23}\), creating stalls in the pipeline and reducing CNN throughput.
2.3 CNN-Based Application
A CNN-based application is an application that requires execution of one or multiple CNNs to perform its functionality. When deployed on a target edge device, a CNN-based application utilizes memory and computational resources of the device to execute the CNNs.
The memory of the edge device is used to store parameters (weights and biases) and intermediate computational results. The platform memory allocated to store the CNNs’ intermediate computational results is typically defined as a set of CNN buffers [
23], where every CNN buffer stores data associated with one or multiple CNN edges and is characterized with size, specifying the maximum number of data elements, that can be stored in the buffer.
The computational resources of the edge device are utilized to perform the functionality of the CNNs. Typically, the CNNs are executed layer by layer—that is, at every moment in time, only one CNN layer is executed on the edge platform. However, as explained in Section
2.2, some of the applications execute CNNs in a pipelined fashion.
In this work, we formally define a CNN-based application as a tuple \((\lbrace CNN^1,\ldots , CNN^N\rbrace , B, P, J\), \(\lbrace schedule_1,\ldots , schedule_{|P|}\rbrace)\), where \(\lbrace CNN^1,\ldots , CNN^N\rbrace\) are the CNNs utilized by the application; \(B\) is the set of CNN buffers, utilized by the application; \(P\) is the set of CNN partitions; and \(J\) is the set that explicitly defines exploitation of task-level (pipeline) parallelism by the application. Every element \(J_i \in J\) contains one or several CNN partitions. If two CNN partitions \(P_m\) and \(P_x, m \ne x\) belong to the element \(J_i \in J\), the CNN-based application exploits task-level (pipeline) parallelism among these partitions; \(schedule_i, i \in [1,|P|]\) is a schedule of partition \(P_i\) that determines the execution order of the layers within partition \(P_i\). Formally, we define \(schedule_i\) as a set of steps, where at each step one or several layers of partition \(P_i\) are executed.
To illustrate a CNN-based application as defined previously, we give an example of a CNN-based application
\(APP=(\lbrace CNN^1, CNN^2\rbrace\),
\(\lbrace B_1,\ldots , B_9\rbrace\),
\(\lbrace P_1, P_2, P_3\rbrace\),
\(\lbrace \lbrace P_1\rbrace , \lbrace P_2\),
\(P_3\rbrace \rbrace\),
\(\lbrace \lbrace \lbrace l^1_1\rbrace , \lbrace l^1_2\rbrace , \lbrace l^1_3\rbrace , \lbrace l^1_4\rbrace , \lbrace l^1_5\rbrace \rbrace\),
\(\lbrace \lbrace l^2_1\rbrace , \lbrace l^2_2\rbrace \rbrace\),
\(\lbrace \lbrace l^2_3\rbrace , \lbrace l^2_4\rbrace \rbrace \rbrace\)), inspired by the real-world CNN-based application for adaptive images classification proposed by Taylor et al. [
29]. To perform its functionality, application
\(APP\) uses
\(N=2\) CNNs:
\(CNN^1\) and
\(CNN^2\), shown in Figures
1(a) and (b), respectively. During its execution, application
\(APP\) accepts a stream of images, also called
frames, and adaptively selects one of its CNNs (
\(CNN^1\) or
\(CNN^2\)) to perform the image classification of the input frame.
\(CNN^1\) consists of one partition,
\(P_1\).
\(CNN^2\) consists of two partitions,
\(P_2\) and
\(P_3\), executed in a pipelined fashion, as shown in Figure
2 and explained in Section
2.2. The layers within every partition
\(P_i, i \in [1,3]\) of application
\(APP\) are executed sequentially (one by one). This is expressed through
\(schedule_1\),
\(schedule_2\), and
\(schedule_3\) of application
\(APP\). For example,
\(schedule_1=\lbrace \lbrace l^1_1\rbrace , \lbrace l^1_2\rbrace , \lbrace l^1_3\rbrace , \lbrace l^1_4\rbrace , \lbrace l^1_5\rbrace \rbrace\) specifies that the layers within partition
\(P_1\) of application
\(APP\) are executed in five steps, and at the
\(j\)-th step,
\(j\in [1,5]\), layer
\(l^1_j\) is executed.
To store intermediate computational results associated with every edge
\(e^n_{ij}\) of
\(CNN^1\) and
\(CNN^2\), application
\(APP\) uses a set of buffers
\(B\), where every edge
\(e^n_{ij}\) has its own buffer
\(B_k\) of size
\(|e^n_{ij}.data|\). Hereinafter, we refer to such buffers allocation as naive buffers allocation. In total, application
\(APP\) uses
\(|B|=9\) CNN buffers. These buffers are shown in Table
1, where row 1 lists the layers within every CNN buffer, row 2 lists the edges using the CNN buffers to store associated data, and row 3 lists the sizes of the CNN buffers expressed in the number of data elements.
2.5 Data Processing by Parts in the CNN Layers
Many CNN operators are characterized with the ability to process data by parts [
2]. Formally, such ability can be expressed as follows: applying a CNN operator
\(op\) to a data tensor
\(data\) can be represented as a sequence of
\(\Phi\) phases, where at every phase operator
\(op\) is applied to a part
\(data^{\prime }\) of the tensor
\(data\). For example, applying CNN operator
\(conv\) to data tensor [1, 3, 32, 32] associated with edge
\(e^1_{12}\) (shown in Figure
1(a) and explained in Section
2.1) can be represented as a sequence of 32 phases, where at each phase operator
\(conv\) is applied to a part [1, 3, 5, 32] of data [1, 3, 32, 32]. The CNN memory reduction methodology proposed in our earlier work [
17] exploits such data processing by parts to reduce the CNN’s memory cost. In this methodology, every layer
\(l_i\) of a CNN processes data in
\(\Phi _i\) phases. At each phase, layer
\(l_i\) accepts a part of the input data, applies operator
\(l_i.op\) to this part of data, and produces the corresponding part of the output data. Each part of the input and output data of layer
\(l_i\) is characterized with minimum height. The minimum height of the data parts as well as the number of phases
\(\Phi _i\) are determined by operator
\(l_i.op\) performed by layer
\(l_i\), hyper-parameters
\(hyp_i\) of layer
\(l_i,\) and the data tensors associated with the input and output edges of layer
\(l_i\). Table
2 shows how the minimum input and output data height and corresponding number of phases are computed for layers performing the most common CNN operators. In Table
2, column 1 lists the most common CNN operators
\(l_i.op\) performed by the CNN layers, columns 2 and 3 show the minimum height of input and output data of layer
\(l_i\), and column 4 shows the number of phases
\(\Phi _i\) performed by layer
\(l_i\). For example, row 2 in Table
2 shows that layer
\(l_i\) performing operator
\(conv\) or operator
\(pool\) can process data in
\(\Phi _i\) phases, where
\(\Phi _i\) is computed as the height of data tensor
\(e_{ij}.data\) produced by layer
\(l_i\). At every phase, layer
\(l_i\) accepts and processes a data part of minimum height
\(H^{in}_{min}\) equal to the layer kernel size
\(hyp_i.ks\) and produces an output data part of height
\(H^{out}_{min}=\)1.
When two layers
\(l_i\) and
\(l_j\) process data by parts, only the part of data exchanged between these layers,
\(e_{ij}.data^{\prime }\), has to be stored in the memory of a target edge device at every moment in time [
17]. The size of the minimum data part
\(e_{ij}.data^{\prime }\) exchanged between layers
\(l_i\) and
\(l_j\) is computed as
\(e_{ij}.data^{\prime }\) = [
\(e_{ij}.batch, e_{ij}.Ch, H^{\prime }, e_{ij}.W\)], where
\(e_{ij}.batch, e_{ij}.Ch,\) and
\(e_{ij}.W\) are the batch size, number of channels, and width of data
\(e_{ij}\);
\(H^{\prime } \le e_{ij}.H\) is computed as follows:
where
\(H_{min}^{out}(l_i)\) is the minimum height of data produced by layer
\(l_i\),
\(H_{min}^{in}(l_j)\) is minimum height of data accepted as input by layer
\(l_j\), and
\(H_{min}^{out}(l_i)\) and
\(H_{min}^{in}(l_j)\) are determined using Table
2.
To illustrate how data processing by parts reduces the memory cost of a CNN-based application, we show an example where layers
\(l^1_1\) and
\(l^1_2\) of
\(CNN^1\), shown in Figure
1(a) and used by application
\(APP\) explained in Section
2.3, process data by parts. The example is illustrated in Figure
3, where layer
\(l^1_1\) has 32 phases and layer
\(l^1_2\) has 32 phases. Execution of the phases of layer
\(l^1_1\) and layer
\(l^1_2\) is performed in a specific order. We formally define this order as a schedule shortly written as
\(\lbrace [\lbrace l^1_1\rbrace ] \times 5, \lbrace l^1_2\rbrace , [\lbrace l^1_1\rbrace , \lbrace l^1_2\rbrace ] \times 27\),
\([l^1_2] \times 4\rbrace\). In the defined schedule, the square brackets enclose the repetitive (sub-sequences of) steps. At every step, a phase of a CNN layer is executed. During the first five steps, the first five phases of layer
\(l^1_1\) are executed, which is expressed at
\([\lbrace l^1_1\rbrace ] \times 5\) in the aforementioned schedule. At every phase, layer
\(l^1_1\) produces the data part of shape
\([1,3,1,32]\) in buffer
\(B_{1}\), used to store the data exchanged between layers
\(l^1_1\) and
\(l^1_2\) as specified in Table
1 in Section
2.3. After the first five steps, the data part of shape
\([1,3,5,32]\) is accumulated in buffer
\(B_{1}\). This part is sufficient to execute the first phase of layer
\(l^1_2\). Thus, at step 6 of the schedule, the first phase of layer
\(l^1_{2}\) is executed (see Figure
3(a)). To execute the second phase of layer
\(l^1_2\) (see Figure
3(b)), the data of shape
\([1,3,5,32]\) should be accumulated in
\(B_{1}\). However, some of this data is already in
\(B_{1}\) because the data between subsequent execution steps of layer
\(l^1_2\) is overlapping. When the overlapping part is stored in buffer
\(B_{1}\), only new (non-overlapping) data should be produced in
\(B_{1}\) to enable the execution of the second phase of layer
\(l^1_2\). This new data can be produced by execution of one phase of layer
\(l^1_1\). Thus, phases 6 through 32 of layer
\(l^1_1\) and phases 2 through 28 of layer
\(l^1_2\) are executed in an alternating manner, where a phase of layer
\(l^1_1\) is followed by a phase of layer
\(l^1_2\), and this pattern repeats until all phases of
\(l^1_1\) are executed. This is expressed as
\([\lbrace l^1_1\rbrace , \lbrace l^1_2\rbrace ] \times 27\) in the aforementioned schedule. Finally, the last four phases of layer
\(l^1_2\) are executed. The maximum amount of data, stored between layers
\(l^1_1\) and
\(l^2_2\) at any time of layers execution corresponds to the data part of shape
\([1,3,5,32]\), accumulated in
\(B_{1}\). Thus, when layers
\(l^1_1\) and
\(l^1_2\) of
\(CNN^1\) process data by parts, the size of buffer
\(B_{1}\) is 1 * 3 * 5 * 32 = 480 data elements, which is
\(3,072/480 \approx 6.4\) times less, compared to the size of buffer
\(B_{1}\) given in Table
1 in Section
2.3. Thus, by introducing data processing by parts into the CNN layers, the methodology in our earlier work [
17] reduces the memory cost of a CNN. However, data processing by parts may cause CNN execution time overheads (e.g., CNN layers may require time to switch among the data parts), leading to CNN throughput decrease. Thus, processing data by parts involves a trade-off between the CNN memory cost and the CNN throughput. In this work, similarly to the methodology in our previous work [
17], we exploit this trade-off to reduce the CNN’s memory cost.
It is important to note that the reduction of application buffer sizes from data processing by parts requires the layers of a CNN to be executed in a specific order formally expressed as a schedule. For example, as explained earlier, layers
\(l^1_1\) and
\(l^1_2\) are executed in phases where the execution order of the phases is defined by the following schedule:
\(\lbrace [\lbrace l^1_1\rbrace ] \times 5, \lbrace l^1_2\rbrace , [\lbrace l^1_1\rbrace , \lbrace l^1_2\rbrace ] \times 27\),
\([l^1_2] \times 4\rbrace\). To find a proper schedule (i.e., execution order of phases in a CNN) similar to the methodology in our previous work [
17], we first perform conversion of the CNN into a functionally equivalent
Cyclo-Static Dataflow (CSDF) model of computation [
5], accepted as an input by many embedded systems analysis and design tools. For the description of a CNN represented as a CSDF model and details of the CNN-to-CSDF model conversion, we refer the reader to the methodology proposed in our earlier work [
17]. Second, we use the
\(SDF3\) embedded systems analysis and design tool [
28] to automatically derive the execution order (schedule) of the phases within a CNN. We also use the
\(SDF3\) tool to automatically compute the sizes of CNN buffers, when the CNN is executed with phases.