1 Introduction
Progress in
Deep Neural Networks (DNNs) has been driving various applications, including computer vision [
10,
11], speech recognition [
1,
7], medical image analysis [
22,
26], malware detection [
24,
30], and many others. DNNs enable learning complex features in input data [
28], achieving superior performance in a variety of tasks, compared to conventional machine learning algorithms.
Excellent performance of DNNs depends on tremendous effort to train the neural network model. The cost of model creation can be captured in the following aspects: (1) Labor and time to create/collect a dataset for training, (2) the cost of computing resources to run the DNN training algorithm, and (3) the search for hyper-parameters that result in optimal DNN models. Therefore, DNN models for specific applications need to be considered a form of intellectual property with high commercial value. This creates a motivation to obtain the DNN models via adversarial/non-commercial means, e.g., via attacks that extract the model. Besides the loss of commercial value, the loss of DNN model confidentiality can also lead to security and privacy problems. An exposed model may facilitate adversarial attacks, where an attacker crafts input samples that make the target model mis-classify [
20], or, membership inference attacks, in which an attacker aims to learn whether an input to the target model belongs to its private training set [
27].
Model extraction attacks utilizing class probabilities (confidence scores) from model output have been demonstrated. In Reference [
32], Tramèr et al., demonstrated a model extraction attack against some DNNs by observing outputs of API queries. In Reference [
14], Jagielski et al., showed an efficient learning-based model extraction attack that utilizes query outputs. Query-based attacks can be effectively defended by limiting model output. Side-channel attacks have also been demonstrated as effective for DNN model extraction [
3,
8,
9,
12,
13,
34,
35,
36]. Side-channel attacks offer a significant risk, because side-channel information emanating during DNN execution is difficult to eliminate. Exploitation of digital side channels enables the attacker to extract coarse-grained information of the target DNN model, such as its architecture. In Reference [
13], Hua et al., proposed a reverse-engineering attack based on observing the accelerator off-chip memory access patterns and enumerating the possible architectures that satisfy the constraints. In Reference [
35], Yan et al. utilized both memory access and timing side channels to reverse-engineer the size of matrices involved in matrix multiplication allowing to infer the DNN architecture. Hu et al. extracted the DNN architecture from the noisy PCIe and memory-bus events on a GPU platform [
12]. Duddu et al. utilized execution time to infer target DNN layer depth [
9].
As this article demonstrates, attacks that exploit device power consumption or EM emanation enable a direct retrieval of DNN weights. In an attack modality, borrowed from attacks on cryptographic implementations, the attacker feeds a large number of inputs and collects corresponding power/EM traces using a hypothesis-testing foundation. The attacker selects a power model that represents the power of an intermediate state of the secret-related computation, makes a hypothesis on the secret, and evaluates the power model based on the hypothesis. The correct hypothesis results in the highest correlation between power/EM traces and predicted power values.
Several efforts have demonstrated such attacks for DNN weight retrieval. In Reference [
3], Batina et al., demonstrated a correlation EM attack on a micro-controller to retrieve approximate values of single-precision weights of a multi-layer perceptron. While the attack is demonstrated on a conventional micro-controller, its feasibility on a customized DNN hardware accelerator with high parallelism is unexplored. In Reference [
36], Yoshida et al., implemented an FPGA-based DNN accelerator with a single
processing element (PE) and performed a correlation EM attack to retrieve the weights, stored in a 8-bit fixed-point representation, by analyzing the multiply-and-accumulate operation. However, this work has not studied the feasibility of an attack on a high-performance accelerator with multiple processing elements, which is a more common deployment scenario. Dubey et al. demonstrated a
differential power attack (DPA) to retrieve the binarized weights of a model implemented in an FPGA-based DNN accelerator with an adder tree used for accumulation [
8]. The attack is only evaluated against binarized weights. Whether the attack can be effective against an 8-bit weight implementation, which is typically used by state-of-the-art accelerators, needs to be examined.
In this work, we demonstrate a differential power attack to retrieve the weights from the FPGA-based matrix multiplication accelerator. We adopt an 8-bit integer (INT8) weight representation and implement a design with multiple processing elements performing parallel multiply-and-accumulate (MAC) operations. The design adopts a weight-stationary dataflow. Matrix multiplication between input activations and weights is the core computational kernel of DNNs, as computation of both convolutional and fully connected layers can be mapped to matrix multiplication. We propose a multi-step DPA framework that utilizes the dependency between MAC results of different weights to determine the value of each weight using measured power traces.
We study both 1D and 2D systolic arrays. We consider a 1D array as a separate case because it is an important VLSI model. In addition, compared to 2D arrays, it can offer better reconfigurability, lower memory bandwidth requirement, and better energy efficiency due to the reduced inter-PE communication and control logic complexity, which may be desired in certain scenarios [
16,
33]. We first study a 3
\(\times\) 1 systolic array. The results show that we are able to retrieve all weights in the weight vector of the 1D array using 20K power traces with a conventional DPA. Next, we study the scalability of the DPA attack to larger designs. We implement a 3
\(\times\) 3 systolic array. The results show that the 2D accelerator exhibits higher security. Both a conventional DPA and a stronger, template-based DPA fail: Only 30% of weights are recovered after 460K traces. We investigate the causes of higher resistance of the 2D array to the attack compared to the 1D accelerator. We explain the reason for higher security of a 2D array design by the fact that it intrinsically entails multiple MAC outputs that are simultaneously dependent on the same input. We show that this is fundamentally a feature of the weight-stationary dataflow. However, we discover that an enhanced template-based DPA with multiple profiling phases manages to expose leakage of each PE column step-by-step and retrieves weights from each column. The attack is able to retrieve all weights from the 2D array with only 40K traces. The results on both 1D and 2D arrays show that both architectures are ultimately vulnerable.
Our contribution can be summarized as follows:
•
We investigate the vulnerability of a 1D systolic array. Our results show that a conventional DPA on the 1D array succeeds fully, requiring 20K power measurements.
•
We investigate the vulnerability of a 2D systolic array. However, only 30\(\%\) of the model weights are retrieved even after 460K power traces with a conventional DPA and a stronger, template-based DPA. Our analysis finds that the higher security of a 2D array arises from the fact that it intrinsically entails multiple MACs that are simultaneously dependent on the same input.
•
We propose a novel template-based DPA with multiple profiling phases, which fully breaks the 2D systolic array.
•
We investigate Hamming distance, Hamming weight, and bit-level power models in the attack on a 2D array.
4 DPA On Dot Product Accelerator
We demonstrate how the DPA framework can be modified to successfully attack the dot product accelerator. The strategy allows the attack to succeed in fully identifying the weight vector pre-loaded on the accelerator. We demonstrate the details of the modified DPA framework that utilizes the dependence of the instantaneous MAC output on the earlier-processed weights. Without such a modification, the naive extension of the DPA, as it is used in attacks on common cryptographic ciphers, e.g., AES, fails.
We start by revisiting the algorithm for attacking AES [
4]:
•
The attacker collects N power traces with T samples per trace. Each trace corresponds to one encryption on a random input plaintext. The attacker arranges the collected power traces into an \(N \times T\) matrix.
•
For a target AES key byte, and for each AES encryption, the attacker calculates a hypothetical power value based on all 256 possible values of the key byte. The calculated hypothetical power values are arranged in a \(256 \times N\) matrix.
•
The attacker calculates Pearson correlation between each row of the power model matrix and each column of the power trace matrix. The correlation coefficients are arranged in a \(256 \times T\) matrix. The row index with the largest value of the correlation matrix represents the attacker’s best guess of the target key byte.
In the DPA on AES, each key byte of the entire key is extracted individually. The attacker typically focuses on the last round of AES encryption. We propose the following strategy for an attack on a DNN. The attack focuses on a single 8-bit weight at a time. The attacker first collects N power traces, each corresponding to a single dot product computation. Then, the attacker locates the clock cycles in the power traces where the MAC operation involving the weight occurred. We propose that the power model be based on the Hamming distance of the register holding the MAC result between consecutive inputs for each weight. The DPA starts analysis from the first weight in the weight vector and targets weights serially.
First, we follow the above framework directly. We collect 100K power traces from the FPGA board, each corresponding to a dot product computation of three random
\(3 \times 1\) input vectors (the input FIFO depth is 3) and the fixed (secret)
\(3 \times 1\) weight vector. We start by focusing on the first weight
\(w_{11}\), which is associated with the first PE in the column. We use the Hamming distance of
Reg C of PE1, over the first two consecutive inputs to form the power model. If we denote the two inputs as
\(x_{11}\) and
\(x_{21}\), then the power model
\(H_{11}\) is:
We use the phase of the power traces that corresponds to the switching of the target register and organize data into a \(100,000 \times 334\) matrix. Then, for each possible value of \(w_{11}\), we calculate \(H_{11}\). The resulting power models are arranged in a \(256 \times 100,000\) matrix. Finally, we correlate each row of the power model matrix with each column of the power trace matrix and select the value of \(w_{11}\) with the largest correlation as the attack guess.
Unfortunately, this direct procedure fails to retrieve the correct value of
\(w_{11}\) (Figure
6(a)). We describe a solution for finding
\(w_{11}\) later. We first describe the strategy for extracting other weights, assuming
\(w_{11}\) has already been retrieved and the partial sums produced by
\(w_{11}\) can be calculated. We now construct the power model
\(H_{21}\) for
\(w_{21}\) as:
In Equation (
4),
\(x_{12}\) and
\(x_{22}\) are the first two inputs within the input FIFO for
\(w_{21}\). We correlate the power models with the targeted portion of the power traces. The correlation coefficients for all possible values of
\(w_{21}\) are shown in Figure
6(b). This time the guess for
\(w_{21}\) is based on the highest correlation corresponding to the correct value. The power model
\(H_{31}\) for the next weight
\(w_{31}\) is:
Here,
\(x_{13}\) and
\(x_{23}\) are the first two consecutive input values into the input FIFO for
\(w_{31}\). The correlation for
\(w_{31}\) is in Figure
6(c). The attack again succeeds to retrieve the correct value of
\(w_{31}\).
Based on the above discussion, the weight \(w_{11}\) seems to be the bottleneck: \(w_{11}\) needs to be retrieved first to retrieve the subsequent weights \(w_{21}\) and \(w_{31}\). It is critical to understand why the DPA fails to retrieve \(w_{11}\) directly. We note that \(w_{11}\) is the first element in the weight vector. This means that the previous partial sum input to PE1 is zero. Therefore, the MAC associated with \(w_{11}\) involves only multiplication. In contrast, the MACs related to \(w_{21}\) and \(w_{31}\) involve both multiplication and accumulation.
DPA examines the correlation between power traces and power models. It is critical that the hypothetical power models based on the incorrect guesses of the secret do not show correlation with the model based on the correct guess. If there is aliasing (correlation between the power model of the correct guess and that of a different guess), then it will cause the incorrect guess to also show high correlation. This effectively reduces the confidence in the correct guess. In block ciphers, such as AES, this issue does not arise due to the fundamental non-linearity of the S-box [
21].
In our case, aliasing occurs. We investigate this further to understand the difference in the behavior of
\(w_{11}\) and
\(w_{21}\). We calculate pairwise correlation across 256 rows of the power model matrix for
\(w_{11}\), based on the DPA described above, and repeat the calculation for
\(w_{21}\). Figure
7 shows the results plotted as 2D color maps. It can be seen that power models of different guesses on
\(w_{11}\) show large correlation indicating large aliasing. Power models of different guesses of
\(w_{21}\) show only a large self-correlation. Note that the difference is due to the absence of accumulation. (In the case of
\(w_{11}\) extraction, the role of accumulation is intriguing, and we plan to explore it further in the future.)
To overcome the difficulty of retrieving
\(w_{11}\) directly, we use the MACs involving additional, subsequent weights to extract the correct
\(w_{11}\). Since the product generated by
\(w_{11}\) also determines the MAC outputs for
\(w_{21}\) and
\(w_{31}\), the hypothetical power model will show high correlation only for the correct weight combinations. To achieve this, we modify the attack procedure to be:
•
Perform DPA on
\(w_{11}\). Construct the power model matrix using Equation (
3) and the power trace matrix. Calculate the correlation between each row of the power model matrix and each column of the power trace matrix. Sort all guesses on
\(w_{11}\) based on the maximum correlation found in the time window. Since the rank of correct guess on
\(w_{11}\) is close to 50, we record the
\(w_{11}\) guesses corresponding to 50 highest correlations.
•
For each recorded guess on
\(w_{11}\), perform DPA on
\(w_{21}\). Construct the power model matrix using Equation (
4) and the power trace matrix and calculate the correlations. Record the guess on
\(w_{21}\) with the highest correlation and its corresponding
\(w_{11}\) guess. This results in 50
\((w_{11}, w_{21})\) guesses.
•
With each recorded
\((w_{11}, w_{21})\) guess, perform DPA on
\(w_{31}\). Construct the power model matrix using Equation (
5) and the power trace matrix and calculate the correlations. Record the guess on
\(w_{31}\) with the highest correlation and its corresponding
\((w_{11}, w_{21})\) guess. Return the
\((w_{11}, w_{21}, w_{31})\) guess with the highest correlation among all 50 recorded combinations as the final guess.
The time complexity of the modified DPA framework depends on how many guesses of
\(w_{11}\) are recorded. The number of
\(w_{11}\) guesses to record leads to a linear increase in computation complexity. We apply the above modified DPA framework to 100K power traces. The correct guess on
\((w_{11}, w_{21}, w_{31})\) is successfully retrieved, as shown in Figure
8(a). We start with 10K power traces and increase the number of traces used in steps of 10K, repeating the experiments. We identify 20K traces as the
Measurement to Disclosure (MTD) for the dot product accelerator. The correlation plot of DPA with 20K power traces is shown in Figure
8(b). We summarize the results of the attack on the dot product accelerator in Table
1. The Rank column shows the rank of the correct guess on
\((w_{11}, w_{21}, w_{31})\) among all 50 recorded combinations and the Correlation column shows the Pearson correlation of the correct guess at 20K (MTD) power traces.
5 Higher Security of Matrix-vector Multiplication Accelerator
In this section, we study the vulnerability of a 2D matrix-vector multiplication accelerator to a DPA-style attack. We find that the 2D accelerator exhibits higher security. Both a conventional DPA and a stronger template-based DPA fail. We investigate the causes of higher resistance of the 2D array to the attack compared to the dot product accelerator. We explain the reason for higher security of a 2D array design by the fact that it intrinsically results in multiple instantaneous MAC outputs being dependent on the same input. We show that this is a fundamental feature of the commonly used weight-stationary dataflow.
We first investigate the conventional DPA attack that uses the approach described above. A conventional DPA does not require access to an identical profiled device. We collect 460K power traces from the matrix-vector multiplication accelerator. (We stopped at 460K traces due to measurement time budget.) For each PE column, the top 50 guesses of the first weight of each column (
\(w_{11}\) for column 1,
\(w_{12}\) for column 2,
\(w_{13}\) for column 3) are recorded and the relevant phases of power traces are selected. However, the conventional DPA fails to retrieve the weights, as shown in Table
2, where NA indicates that the correct weight combination does not appear in the 50 recorded weight pairs.
We now investigate a template-based DPA, which assumes a stronger adversary. Since DPA relies on the analysis of power consumed by MACs, to improve the effective SNR, we propose a profiling technique that removes all non-MAC power. We assume an attacker has full access to an identical device, which can be used for profiling. Hence, the attacker can modify the secret weights of the profiled device and collect its power traces. Specifically, the attacker sets all the weights of the profiled device to zero. A trace (template) from the profiled device captures the systolic array power minus the MAC power. The attacker produces a power template for each observed input. This means the same number of template power traces need to be collected from the profiled device as the target power traces. The attacker subtracts the template power trace from the target power trace, which is then used for DPA. We note that the proposed attack is different from the template attack of Chari et al. [
6], which constructs a template using the mean trace and the noise covariance matrix for each key value. We call our attack the template-based DPA to reflect the fact that an identical device is used for profiling.
The attack just described allows extracting some, but not all, weights, even after collecting a much larger number of traces (460K traces). Table
3 shows that the template-based DPA fails to retrieve the weights of two out of three columns.
We verify our conclusions on the 2D matrix-vector multiplication accelerator by performing the attack using simulated noise-free power traces. We generate the simulated traces by modeling the register switching in each PE. The power consumption of each PE at a specific clock cycle is modeled as:
The power contribution of Reg A and Reg C defines each term of the equation. \(HD(\cdot)\) denotes the Hamming distance of the register values over two consecutive clock cycles. Coefficients \(\alpha\) and \(\beta\) are used to adjust the contributions of registers in different PEs, based on their load capacitance. Specifically, for PE1, PE2, PE4, PE5, PE7, and PE8, we use \(\alpha = 3\) and \(\beta = 2\); for PE3, PE6, and PE9, we use \(\alpha = 1\) and \(\beta = 2\). We sum the power of each PE to get the total power at a specific clock cycle. This represents the power averaged over one clock cycle. We repeat the register-switching calculation for each compute-active clock cycle of the 2D systolic array, obtaining a simulated trace with seven samples.
We generate 460K simulated noise-free traces using the same inputs as the 460K measured traces. We perform both the conventional DPA and template-based DPA with the simulated traces and summarize the results in Tables
4 and
5.
We also explored the Hamming weight power models and bit-level power models in the template-based DPA. For a specific PE at a specific clock cycle, the Hamming weight power model is constructed as the Hamming weight of
Reg C. The bit-level power model is constructed as the Hamming distance of a single bit of
Reg C [
19] over two consecutive cycles. (We selected one bit to explore the behavior, but we believe the results do not depend on which bit is used.) We substitute the power models given by Equations (
3) to (
5) with the Hamming weight and bit-level power models and repeat the DPA described above. Unfortunately,
none of the weights could be retrieved with the new power models even with 460K power traces, as shown in Tables
6 and
7. We believe that the failure of the Hamming weight power model is due to the inaccurate capture of the PE power consumption, while the failure of the bit-level model is due to the interference of switching of different register bits.
The 2D matrix-vector multiplication accelerator appears significantly less vulnerable to a DPA-style attack compared to a simpler 1D array. To understand the source of the improved resistance to DPA, we conduct additional experiments. We investigate the exploratory case where only a single column of PE of the
\(3 \times 3\) systolic array is activated. In this case, the accelerator is essentially performing a dot product of the input vector with the weight column. The difference is that inputs still propagate horizontally across the PEs and contribute to power. We implement this case by pre-loading the weights of the target PE column only and pre-loading zero weights to the remaining PE columns. MACs of these columns do not contribute power, since all their partial sums remain zero. For consistency with the previous template-based experiment, we assume the attacker has access to an identical device to collect power traces. We perform exploratory study for each individual PE column: the PE column is active while other columns are inactive. In each study, the templates (to be subtracted) are based on 100K traces. Table
8 summarizes the results of the exploratory studies for the individual PE columns.
The experiments demonstrate that all weights in each PE column were retrieved, needing at most 30K traces. This confirms the vulnerability of the dot product to DPA. It also proves that simultaneous MACs of different PE columns create a higher resistance to DPA by contributing power interfering with the selection of the correct hypothesis.
We believe that this behavior, in which MAC operations of different columns cause issues for DPA, is a general characteristic of 2D accelerators based on the weight-stationary dataflow. The behavior is caused by
multiple MAC outputs depending on the same inputs simultaneously. This is because the weight-stationary dataflow results in inputs to the matrix-vector multiplication array to be arranged in a diagonal wave-front format, as shown in Figure
4. The MAC outputs of a PE and the PE on its lower left in its adjacent column (if applicable) will be affected by the same input(s) simultaneously. To illustrate this dataflow feature, we focus on PE2, PE3, and PE5. The MAC output to be computed in PE3, and the previous partial sum to PE5, are determined by the same input at a specific clock cycle. PE5 accumulates its input-weight products with the previous partial sum. This means the same input affects the MAC outputs of both PE3 and PE5
simultaneously (Figure
9).
We assess the interference by examining the correlation between switching of
Reg C of different PEs in the same clock cycle. Since
Reg C holds a MAC output of each PE, the power due to the switching of
Reg C represents each PE’s MAC power for the purpose of DPA correlation analysis. We calculate the pairwise correlation of Hamming distances of
Reg C in PE1 to PE8 (Table
9).
We observe that some PEs exhibit large correlation in addition to self-correlation. Specifically, PEs whose MAC results are simultaneously affected by the same inputs show non-zero correlation: (PE2, PE4), (PE3, PE5), (PE5, PE7), and (PE6, PE8). Some of them, (PE3, PE5) and (PE6, PE8), show high correlation. As discussed earlier, such correlation will interfere with DPA’s effectiveness and reduce the confidence of the correct hypothesis for a target PE.
In contrast, for a 1D array, such interference does not occur, since the inputs stop propagation after being consumed by the PEs and the same input is never used across multiple PEs. The MAC outputs of different PEs in the 1D array, at any clock cycle, are determined by different inputs. Thus, MAC power of different PEs will
not show correlation as shown by zero correlations between (PE1, PE4, PE7) and between (PE2, PE5, PE8) in Table
9.
6 Breaking 2d Accelerator with Enhanced Template DPA
In this section, we demonstrate a stronger template-based DPA that succeeds in fully retrieving the weights of the 2D array. The attack requires multiple profiling phases. We call it multi-phase template-based DPA.
As discussed, 2D array exhibits higher resistance to DPA due to parallel MAC operations of different PE columns. To fully retrieve the weights, it is critical to focus on each PE column individually and remove the interference of other columns. However, localizing leakage from each PE column is challenging, because the power trace captures aggregate power of the entire 2D array. However, it is possible to expose the leakage of each PE column via a sequential analysis. Specifically, we identify the most vulnerable PE column, extract its weights, remove the effect of the column by template, and move to the next most vulnerable remaining PE column. The process can be repeated until the weights from all PE columns are retrieved. The main question is how to find the most vulnerable PE column of the 2D array.
Using previous results, we observe that the rightmost PE column (PE3, PE6, and PE9) appears to be the easiest to break. The attack can break the column with 20K power traces, while the other columns remain secure even after 460K power traces. The reason for this behavior is that input propagation stops at the rightmost PE column. We believe that
the PE column furthest from inputs is the most vulnerable one and its weights can be retrieved most easily. Based on the hypothesis, we propose the following attack. For simplicity, we use the term “trace” to refer to a power trace and “template trace” to refer to a power trace collected from a profiled device with the fixed weights.
•
Perform DPA on column 3 (PE3, PE6, PE9). Run DPA for 1D array. Retrieve weights \((w_{13}, w_{23}, w_{33})\).
•
Set the weights of PE3, PE6, and PE9 on the profiled device to \((w_{13}, w_{23}, w_{33})\). Set other weights to zero. Collect a phase-1 template trace for each input of the original set of traces.
•
Subtract phase-1 template traces from the corresponding original traces. Perform DPA on column 2 (PE2, PE5, PE8) using updated traces. Run DPA for 1D array. Retrieve weights \((w_{12}, w_{22}, w_{32})\).
•
Set the weights of PE3, PE6, and PE9 on the profiled device to \((w_{13}, w_{23}, w_{33})\). Set the weights of PE2, PE5, and PE8 to \((w_{12}, w_{22}, w_{32})\). Set other weights to zero. Collect a phase-2 template trace for each input of the original set of traces.
•
Subtract phase-2 template traces from the corresponding original traces. Perform DPA on column 1 (PE1, PE4, PE7) using updated traces. Run DPA for 1D array. Retrieve the final set of weights \((w_{11}, w_{21}, w_{31})\).
We collect 60K traces from the matrix-vector multiplication accelerator and perform the above attack. The process requires 120K template traces to be collected in total. The attack succeeds in retrieving
all weights from the 2D array (Figure
10). We summarize the results for each PE column in Table
10, identifying the MTD for each column.
We now analyze the time complexity of the multi-phase template-based DPA. The attack needs to apply a 1D DPA for each profiling phase of an individual PE column. Therefore, the complexity of the attack is proportional to the number of PEs in the 2D array. The cost of template construction is proportional to the number of PE columns. The complexity also depends on the number of guesses to record for the first weight of each PE column.