5.1 Defense Process in Image Classification
Primary Activation Pattern Localization: For the image condensed adversarial attacks defense, we mainly depend on the
input semantic inconsistency in input pattern level. Therefore, we need to locate the primary activation source from the input image by adopting a CNN activation visualization method–Class Activation Mapping (CAM) [
49]. Let
\(A_k(x,y)\) denote the value of the
\(k{\rm {th}}\) activation in the last convolutional layer at spatial location
\((x,y)\).
We can compute a sum of all activations at the spatial location
\((x,y)\) in the last convolutional layer as
where
K is the total number of activations in the last convolutional layer. The larger value of
\(A_{T}(x,y)\) indicates that the activation source in the input image at the corresponding spatial location is more important for classification results. For a natural input, it is the object pattern’s location whereas it is the adversarial patch’s location for an adversarial input.
In order to conduct further self-detection and data recovery, we need to determine the specific size of the primary activation pattern area. During this step, we first identify the location \((x_m,y_m)\) with the highest \(A_{T}(x,y)\) on the input image. Next, if the adversarial patch size and shape are given, we can easily select the areas with the same size and shape based on the location \((x_m, y_m)\). However, when patch size and shape are not known beforehand, we need to first calculate the average activation value \(A_a\) across the entire image. Then, starting from \((x_m,y_m)\), the surrounding locations whose values are higher than \(A_a\) are considered in the pattern area.
Inconsistency Derivation: According to our preliminary analysis, the input adversarial patch contains much more high-frequency information than the natural semantic input patterns. We first leverage 2D Fast Fourier Transform (2D-FFT) [
41] to transfer the patterns from the temporal domain to the frequency domain and thereby concentrate the low-frequency components together. Then, we convert the frequency-domain pattern to a binary pattern with an adaptive threshold. Figure
10 shows a conversion example, including adversarial patterns, expected synthesized patterns with the same prediction result, and natural input patterns. For binary patterns, we observe the significant difference between adversarial input and semantic synthesized input. Therefore, we replace
\(S(I_{pra},I_{ori})\) with the Jaccard Similarity Coefficient (JSC) [
28] and propose our image inconsistency metric, which is formulated as
where
\(P_{exp}\) is the synthesized semantic pattern with predicted class.
\(P_{pra}\bigcap P_{exp}\) means the numbers of pixels where the pixel value of
\(P_{pra}\) and
\(P_{exp}\) both equal 1. For image classification, the input semantics patterns from expected prediction results can be referred by the ground-truth dataset. By testing a CNN model with a certain amount of data once, we can record the model’s preferred natural semantic input pattern by leveraging the CAM and size determination methods discussed earlier.
With the described inconsistency metric, we propose our specific defense methodology, which contains self-detection and image recovery, described in Figure
11.
Self-Detection: For each input image, we apply CAM to determine the source location of the largest model activations. Then, we crop the image to obtain patterns with maximum activations. During the semantic test, we calculate the inconsistency between \(P_{pra}\) and \(P_{exp}\). If it is higher than a predefined threshold \(T_{ic}\), we consider an adversarial input detected. The threshold value \(T_{ic}\) is determined by the prepossessing works. Specifically, for a given dataset (e.g., ImageNet-10), we first generate the synthesized semantic patterns for each class (e.g., 100 patterns in our experiment). Then, we calculate the inconsistency value across patterns in each class and assign the average value as \(D_{avg}^{ground}(i)\), where i indicates the \(i{\rm {th}}\) class. Next, we generate a certain number of adversarial patches for each class (10 in our experiment) and calculate the inconsistency value between them and the target synthesized semantic patterns. We consider the average inconsistency values as \(D_{avg}^{adv}(i)\). Based on these settings, the value range of threshold \(T_{ic}\) for each class is between \(D_{avg}^{ground}(i)\) and \(D_{avg}^{adv}(i)\).
Data Recovery: After the patch is detected and located, we conduct image data recovery by directly removing the patch from the original input data. In our case, considering the requirement of lightweight computation workload, two potential image inpainting methods are adopted: the
Zero Mask and
Telea methods [
40] (shown in Figure
12).
Zero Mask directly sets all pixel values inside the patch area as 0, which achieves the smallest computation workload and has been applied already in a recent adversarial patch defense work [
50]. As Figure
12 shows, the masked area will not affect the image classification results if the patch location is outside the object. However, when the attack is inside the object, directly masking the pattern with black will degrade further prediction performance. On the other hand, the
Telea method achieves better inpainting performance while slightly sacrificing computation efficiency. We will evaluate the recovery performance of the two methods in terms of effectiveness and efficiency in Section
8.
5.2 Computational Complexity Analysis
The total computation complexity of the defense process in the image classification scenario is a result of the following four steps: CNN inference, maximum activation pattern localization, inconsistency metric calculation, and image interpolation. We model each step’s computational complexity as follows.
CNN Inference: When the input image is first fed into the CNN, the inference computational complexity
\(C_C\) is
where
\({(r^j_i)}^2\) represents the
\(j{\rm {th}}\) filter’s kernel size in the
\(i{\rm {th}}\) layer,
\(h^j_i w^j_i\) denotes the corresponding size of the output feature map,
L is the total layer number and
\(n_i\) is the filter numbers in the
\(i{\rm {th}}\) layer.
Primary Activation Pattern Localization: Since computation complexities of other operations such as cropping are negligible, we consider CAM to contribute the primary computational complexity in this step. In CAM, each spatial location \((x,y)\) in the last convolutional layer is the weighted sum of K activations. Therefore, the total computational complexity is \(C_M \sim \mathcal {O}(Kh^{n_L}_L w^{n_L}_L)\), where \(h^{n_L}_L w^{n_L}_L\) is the size of the feature map at the last convolutional layer.
Inconsistency Metric Derivation: This step consists of 2D-FFT calculation and JSC calculation. According to the analysis in [
20], the computational complexities of these two processes can be approximated to
\(C_F \sim \mathcal {O}(NlogN)\) and
\(C_J \sim \mathcal {O}(n_alogn_a)\), where
N and
\(n_a\) represent the
N pixel number in the input image and maximum activation pattern, respectively.
Image Inpainting: For Zero Mask, the total operation number is \(C_z \sim \mathcal {O}(n)\), where n is the pixel number inside the patch. For the Telea method, the total computation complexity is \(C_t \sim \mathcal {O}(3bn)\), where b represents the total operation number when inpainting each pixel.
Compared with activation localization, metric derivation, and image inpainting, the computational complexity of CNN inference dominates the entire computational complexity in the image scenario. Since our methodology involves only one CNN inference, it has the same-order computation workload as the normal CNN prediction.