1. Introduction
The development of DNNs brings significant convenience to people’s lives. However, recently, researchers have found that DNNs are vulnerable to adversarial examples [
1]. The works have shown that an image with small perturbations can fool a classification system trained by DNNs. These images with imperceptible perturbations are called adversarial examples AEs.
The low-cost adversarial examples will make the DNNs return the wrong output, and they thus bring great damage and harm to the applications based on DNNs. Adding perturbations to the road signs would cause the auto-driving system to make a wrong decision [
2]. Wearing adversarial glasses [
3] or hats [
4] would enable a person to pretend to be someone else when passing the surveillance systems. Dressing in adversarial T-shirts would make criminals disappear from the surveillance systems [
5]. These AEs cause great inconvenience and harm to people’s lives.
The concerns about the video models’ security cause the focus of AEs to turn to adversarial video examples (AVEs). Recently, the key applications based on video classification models have begun to be applied to some critical systems. Those AI systems are directly related to personal safety and property security. For example, they are widely used in the fields of smart home [
6], automatic driving [
7], elderly care [
8], and property protection. However, the harm brought by AEs would cause a great threat to those critical systems. Therefore, it is crucial to study the adversarial examples of video models and thus further improve the robustness of those models.
According to the number of perturbations added to the inputs, the current attacks by AVEs can be divided into two classes according to their settings. One is the sparse AVEs, and the other is dense AVEs. The sparse AVEs only add perturbations on several frames [
9], not all the frames. However, the dense AVEs add perturbations on each frame among a video (similar to [
10,
11]). At the beginning of the research on AVEs, people were focused on breaking through the boundary between AVEs and AEs. They tried to convert the AEs generation methods, such as FGSM [
1], and other methods [
12,
13] to AVEs. However, these methods have lower fooling rates. Then, some researchers generated more effective AVEs based on the video feature (such as [
10,
11]). However, these methods cost great resources to generate one AVE because they need to add the perturbations on each frame among a video. Therefore, the work [
9] proposes a method that only adds perturbations on several sparse frames to decrease the number of perturbed frames and acquire a better performance on the fooling rate.
In the sparse AVEs, they use spatial sparsity and temporal sparsity to define the number of perturbed frames and pixels. Specifically, temporal sparsity means the proportion of clean frames versus all frames of a video. The spatial sparsity means the proportion of clean pixels versus all pixels of a video. A higher temporal sparsity and spatial sparsity mean fewer frames and pixels being perturbed. The detailed definitions of them are described in the following section.
Compared to the dense AVEs, the sparse AVEs avoid the redundancy of the adversarial perturbations and have the following advantages: (1) Sparse video attack can reduce the computational complexity. Sparse video attack only adds perturbations on several frames and has a lower total value of adversarial perturbations. There is no need to take the computational cost to generate complex perturbations. According to our experimental results, our sparse attack costs less time to generate adversarial videos compared with other sparse attacks; (2) Sparse video attack can improve the imperceptibility to human observers. Our sparse attack has higher spatial sparsity and temporal sparsity than other sparse adversarial video attacks. This means that it only needs to perturb fewer pixels of videos to generate adversarial videos. Meanwhile, for human observers, it has better invisibility; (3) Sparse video attack gives a better interpretation of adversarial video attack. The perturbation positions reveal which frame and which part of that frame are important but also vulnerable for the prediction by the video classifier; (4) Sparse video attack can decrease the queries of adversarial video attack under black-box settings. Sparse video attack generates adversarial videos in a low-dimensional manifold. It decreases the high dimension of videos. Compared to dense attacks, the more the sparsity of the adversarial video attack, the fewer queries it needs to query the results from the black-box models.
However, the current sparse video attack is not sparse enough. First, they only focus on generating AVEs with temporal sparsity and ignore the spatial sparsity. Specifically, they only consider how to decrease the number of perturbed frames and do not consider how to decrease the number of perturbed pixels among the perturbed frames. However, different from images, the video has two-dimensional features: the temporal and the spatial features. Therefore, it is crucial to consider the temporal sparsity and spatial sparsity simultaneously. Second, the temporal sparsity is not enough. Current sparse AVEs still need to perturb nearly half of the video frames to acquire a successful attack. However, with the development of video classification models, they need to input more frames to obtain better performance, such as increasing inputs from 16 frames as input to 48 frames extracted from a video. Therefore, the cost would significantly increase, and current sparse AVEs still need great cost to generate successful AVEs. Therefore, there still is a gap between the temporal sparsity and the real requirements, and the key point in a sparse attack is working out how to increase the temporal and spatial sparsity of an adversarial video simultaneously as much as possible.
Figure 1 shows one of the AVEs generated by our algorithm. In
Figure 1, the last line is the adversarial perturbations that we add to the original videos. We can only perturb one frame and several pixels to generate the AVEs with better spatial sparsity and temporal sparsity.
Therefore, in this paper, we present a method to generate sparse adversarial videos, which can significantly improve both the temporal and spatial sparsity at the same time. We are motivated by the physical meaning of the forward derivative. According to its definition, the forward derivative of each pixel can reflect the contribution of that pixel to the output of video models. Thus, we calculate the forward derivative of the frames. It can also reflect the contribution of that frame to the output. We select the key frames and key pixels according to their forward derivative, respectively, and then only add perturbations on the selected key frames and pixels. The method of selecting key frames and pixels can significantly improve the temporal sparsity and spatial sparsity.
However, compared with image data, video data have the problem of dimension explosion when computing the forward derivative. To solve this problem, we introduce the super-pixels [
14], and compute the gradients based on the super-pixels, i.e., the pixels within one super-pixel share the same gradients. In this way, we can reduce the number of pixels to be computed.
Moreover, different from other adversarial generation methods that can only adapt to one setting, either the white-box setting or the black-box setting, we also devise a framework to generate adversarial examples under the two settings. In the white-box video attack, the attacker knows all the knowledge of a model, while in the black-box video attack, the attacker can only access the model’s outputs.
Specifically, in the white-box attack, we first use the super-pixels to reduce the dimension of a video and compute each super-pixel’s forward derivative as the contribution. Then we sum that value within a frame as the contribution of that frame and search for the key frames, key super-pixels, and key pixels by ranking those contributions. Lastly, we can acquire the adversarial videos by only perturbing these key pixels. We can find the closest adversarial videos to the original sample, which has the lowest number of disturbed pixels, through this method.
In the black-box attack, we cannot directly calculate the contribution of frames or pixels due to the lack of models’ details. Therefore, we choose to estimate the forward derivative based on natural evolution strategy (NES) [
15]. However, this value’s accuracy and total resource consumption are directly proportional to the number of pixels needed to be calculated. Therefore, different from the white-box attack, we first select the key frames by the output of every single frame to reduce the dimension of the video and then use super-pixels to reduce the dimension of the key frames, which can decrease the number of objects needed to be estimated to as few as possible. After that, we select the key pixels by the estimated contribution. Finally, we add perturbations on the selected key pixels. Using iterative computations, we will generate an adversarial video with the smallest perturbations in a short time.
Our major contributions can be summarized as follows:
We propose a novel method to generate adversarial videos with large sparsity in both the temporal and spatial domains. It adds perturbations only on the key pixels of the key frames and finally generates an adversarial video with fewer pixels disturbed in a short time.
We introduce super-pixels to solve the dimension explosion problem that exists in attacking video data. This method can decrease the number of pixels whose gradients need to be computed, and thus improve the efficiency of generating adversarial examples.
Our algorithm can work in different two settings: the white-box attack and the black-box attack. We test its attacking performance against two mainstream models on two public datasets. Results show that compared with the state-of-the-art video attacking methods, our method can achieve a similar attacking performance, but it only pollutes <1% pixels and costs less time.
The rest of this paper is organized as follows: In
Section 2, we briefly review the related work. In
Section 3, we describe our algorithm in detail. In
Section 4, we present our experimental results and compare them with other methods. Finally, we conclude our paper in
Section 5.
3. Methodology
Our sparse attack of video models can be implemented both in white-box and black-box conditions, and generate the target and non-target AVEs in the two conditions.
Figure 2 overviews the proposed method. The algorithm is divided into two parts: one is the white-box attack shown at the top and the other one is the black-box attack at the bottom. In the white-box attack, we can access the structure and parameters of the threat classifier
F, but in the black-box attack, the only knowledge of the model we have is the outputs.
In the the white-box setting, it calculates the super-pixels to reduce dimension of input video, and then it calculates the forward derivative of each super-pixel as the contribution of video to the output. According to the forward derivatives of each super-pixel, it can calculate the contribution of the frames when all derivatives in total are among a frame. The forward derivative is the contribution of a frame to the output. Then, we construct the saliency map to show the different inputs that have the greatest impact on the specific output of the classifier. Through the saliency map, we select the key frames, key super-pixels, and key pixels. Finally, it adds adversarial perturbations to the key pixels to generate effective AVEs. Through this method, we can generate the AVEs close to the original example.
In the black-box attack, we first select the key frame according to the output of each frame. Then, it uses super-pixels to reduce the dimension of the key frame, and due to the parameters being inaccessible, we estimate the forward derivative based on NES instead. Finally, we select the key pixels according to the forward derivative and add adversarial perturbations to the key pixels.
Above all, the key differences of the algorithm between the two conditions are how to acquire the forward derivative of the super-pixel among a video and the node for dimension reduction. We detail the difference in the following section.
3.1. White-Box Attack
Denotation: In this setting, we define the classifier function of the threat model as F, a clean video input as , where denote the number of frames, frame width and frame height, respectively. The ground-truth label of X is defined as , where C is the number of classes. , is the i-th frame of X. We use as the j-th super-pixel of frame , and as the k-th pixel within the . Thus, , and . An adversarial video makes in the non-target attack while in the target attack, where g is the target label. If a frame , a pixel , and a super-pixel are labeled as , , and , it means that they are the key frame, key pixel, and key super-pixel, respectively. We also define as the set of key frame and as the set of key super-pixels.
The main flow of the algorithm under the white-box setting is
Determine the label of attack target.
Reduce dimension for the video.
Compute forward derivative and saliency map.
Search for key frames and key pixels.
Add perturbations.
3.1.1. Determine the Label of Attack Target
The first step of the algorithm is to determine the label of attack target, which can be specified by the attacker. When it is the non-target attack, the algorithm can automatically calculate the category closest to the original sample as the target category
g. Compared with selecting a label as target randomly, it can cost the minimum resources to generate the AVEs that can make the model give a wrong label. The equation is Equation (
2). The loss is the cross-entropy function.
3.1.2. Reduce Dimension for the Video
Super-pixels algorithms group pixels into perceptually meaningful atomic regions which can be used to replace the rigid structure of the pixel grid. They capture image redundancy and greatly reduce the complexity of subsequent image processing tasks [
14]. We first reduce dimension of a video by introducing SLIC, a state-of-the-art super-pixels algorithm. It is used to adapt
K-means clustering to generate super-pixels. Ideally, after this process, if it combines the
pixels around one pixel in the frame into the same pixel, the number of pixels whose derivative need to be computed will decrease from
to
. We denote the video after super-pixel calculation as
,
.
3.1.3. Compute Forward Derivative and Saliency Map
After reducing the dimension of the video, the algorithm needs to acquire the contribution of each super-pixel to perform the selection step.
We thus compute the forward derivative of that processed video
, under the output of classifier
as
.
is a tensor, which describes the contribution of each super-pixel among that video to the score of classifier under all
C classes. Each element
of that cube
is the first-order partial derivative of
jth super-pixel
within
ith frame
under the
cth class. That value is positively correlated to the contribution of that super-pixel to the score of the current class. We compute
by Equation (
3).
Specifically, according to the physical meaning of the forward derivatives [
26], the forward derivative tells us which input regions are unlikely to yield minor perturbations. When the forward derivative of super-pixel
bigger than 0, that is
, it means that when adding minor perturbations to that super-pixel, the output of target class
would increase. Meanwhile, the bigger the
, the more influence the super-pixel gives to the output of the model. Therefore, the value of Equation (
3) shows the influence of super-pixels.
The
means the output on
cth class of that video after being a reduced dimension. According to that cube, we can acquire the forward derivative of each super-pixel corresponding to
C classes. Specifically, in the classification model, the final output of the classification model depends on the output scores of the video in all
C classes. Therefore, in order to measure the criticality of a super-pixel, the algorithm needs to comprehensively consider both its contribution to the target class and its contribution to other categories. If a super-pixel can make a positive contribution to the target category but a negative contribution to other categories, that super-pixel would play a major role when it is disturbed. Therefore, in order to generate more effective AVEs, we need to find a super-pixel with the positive value of the forward derivative of the target class but a negative value of the other class. To successfully search for the super-pixels with those features, we thus construct the saliency map
by the forward derivative computed before as Equation (
4).
The saliency map
is a sparse array, that is, its element has effect value only when a super-pixel has a positive value of the forward derivative of the target class while a negative value of the other classes, but in other cases, the value is 0. We call a super-pixel with a valid value an effective super-pixel, which means that super-pixel
will contribute more to generate the adversarial video than other super-pixels. Adding appropriate perturbations to that super-pixel can move the video to the target class more easily. Equation (
4) shows the saliency map.
3.1.4. Search for Key Frames and Key Pixels
With the help of , the algorithm can measure the effectiveness of each super-pixel. However, suppose we filter the key super-pixels directly in the whole video by . In that case, it may cause the final generated perturbations to spread among all the frames of a video sample, and it will result in a very low temporal sparsity of the generated adversarial videos. Therefore, to improve temporal sparsity, we should control the number of disturbing frames. We should first filter the key frames and then select the key super-pixels among those frames.
In order to search for the key frames, the algorithm calculates the saliency map for each frame
by Equation (
5). Similar to the physical meaning of
,
means the criticality of each frame of the input video. Therefore, the algorithm selects a frame
with the maximum
as the key frame
and a super-pixel with the largest saliency map value
as the key super-pixels
of
.
However, in that part, we should solve two problems: one is how to select more key frames and super-pixels effectively; the other is that adding perturbations to the key super-pixels brings redundancy. We show the details in the following words.
First, it does not work well when only selecting one key frame and one key super-pixel. In most cases, for an effective attack, we should choose more than one key frame and key super-pixels to add perturbations to them. It means that adding perturbations on a single key frame or key super-pixel alone is not enough to cause models to misclassify the modified input. Therefore, we define n as the number of key super-pixels selected in each iteration to make the attack successful. There is no need to set a variable to define the number of key frames such as n, though we also need to search more than one key frame. The reason is that the maximum would change with different n. Therefore, the index of key frame will change automatically with the key super-pixels in the experiment.
We rank the saliency map of the selected key frame, and select the super-pixels with top n values to construct the set of key super-pixels of key frame , and when the algorithm is to the end, we collect the index i of key frame selected in each iteration and integrate them as the set of key frames .
Second, it brings redundancy if directly adding perturbations to the key super-pixels. The super-pixels contain lots of real pixels of a video. If adding perturbations on all pixels belonging to super-pixels directly, the perturbations will be more perceptible to human eyes. Moreover, these pixels do not share the same forward derivative in the real video, so their real contributions to output are not the same. Therefore, adding perturbations on key super-pixels will cause pixels waste. We select only n pixels of those super-pixels as key pixels , and only add perturbations on these key pixels.
Lastly, the algorithm should set an appropriate value of n. When n is small, the algorithm will drop into an “endless loop” quickly. The reason is that the perturbed key pixels reach the boundary value and cannot be modified anymore, so the algorithm will compute the same saliency map, select the same key frame and the same key super-pixels in each iteration, and finally still modify the same pixels of those super-pixels.
To solve this problem, one trick is to change n into a larger one and the other is to change the boundary value of each pixel into a larger one. However, if we choose the second trick, perturbations of key pixels will become larger, which leads to larger total distortion of key frame so that more perceptible perturbations are shown in an adversarial video. Considering that the number of disturbed pixels and perturbations of each pixel are contrary variables, we should find a balance between them under the constraints of a successful attack. In our algorithm, we dynamically change n to find that balance, keeping the pixel boundary value unchanged. Specially, considering the global time consumption, we set n as a small value in the beginning of the algorithm and gradually increase its value when all the key pixels reach to the boundary, and the algorithm will double n when it falls into an “endless loop”.
Above all, in the process of selecting the key super-pixel, we first rank the saliency map and initially set , and then select n key frames, key super-pixels, and key pixels. If the perturbations are not enough to make a successful attack, the algorithm will double n and then repeat the process.
3.1.5. Add Perturbations
Finally, we add perturbations to these key pixels within key frames by Equation (
6). The perturbations should be able to move the video as close as possible to the target class in target attack or as far as possible to the ground class in non-target attack. Therefore, the direction of perturbations should be the same as the sign of forward derivative of that pixel under the target class, which is
.
is the basic value of disturbance, means the one added to each key pixels in each iteration, and the
is the same as
, the direction of the super-pixel that the key pixel belongs to.
The whole algorithm in the target attack is summarized in Algorithm 1, where
is the maximum iteration number, and
and
are the boundary value of pixels in the video.
Algorithm 1 Crafting the White-Box Attack in the Target Mode |
Input Input Parameters Output
- 1:
Let , , . - 2:
Set the target label . - 3:
Reduce the dimension of X using SLIC: . - 4:
while and do - 5:
Compute forward derivative by Equation ( 3). - 6:
Construct by Equation ( 4). - 7:
Compute by Equation ( 5). - 8:
- 9:
if then - 10:
Add to the set of key frame . - 11:
end if - 12:
Rank . - 13:
Select super-pixels with top n value of as key super-pixels of key frame . - 14:
Select n key pixels within of . - 15:
. - 16:
if all the value of key pixels ∉ then - 17:
. - 18:
end if - 19:
. - 20:
end while - 21:
return
|
3.1.6. Analysis of the Algorithm under the White-Box Setting
This section analyzes the computational complexity and the implementation cost of the AVEs generation method under the white-box setting. First, according to the work [
14], the computational complexity of the algorithm SLIC is
, where
N is the number of pixels needs to calculate super-pixels. In that algorithm, under the white-box setting, it needs to calculate super-pixels for all the videos. Therefore,
, and thus the complexity is
. In addition, in the algorithm’s loop, the key step is computing the forward derivatives of the video. The complexity of that step is also
, where
n is the number of objects needed to calculate forward derivatives. It is the number of super-pixels of the video. Therefore, the complexity of computing the forward derivatives is
. Because the maximum number of the loop is
, the worst complexity is
, and the best complexity is
.
However, according to that, the algorithm uses DNNs to compute related parameters and the parameters of the DNNs model are too many to be used to calculate the complexity exactly. Therefore, in our paper, we use the metric time to evaluate the time complexity of that algorithm. The detailed results are shown in
Section 4.1.3.
3.2. Black-Box Attack
In the black-box setting, we can only access the output of a video model. Therefore, different from the white-box attack, we estimate the forward derivative of the model based on NES [
15] instead of computing it directly.
However, the computational cost positively correlates to the number of super-pixels that need to be estimated. In a black-box attack, if we use the same method of white-box attack that selects the key frame according to the forward derivative, we should estimate the forward derivative of all the super-pixels of a video. That step is time-consuming. In addition, the more the derivatives need to estimate, the lower their accuracy. Therefore, it is necessary to reduce the number of estimated objects.
The main flow of the algorithm under the white-box setting is
Determine the label of attack target.
Search for key frames.
Compute forward derivative and reduce dimension for the key frames.
Estimate the forward derivative and saliency map.
Search for key pixels.
Add perturbations.
The difference between the flow of white-box setting and black-box setting are sections “search for key frames”, “estimate the forward derivative and saliency map”, and “search for key pixels”. We thus detail the difference as follows.
3.2.1. Search for Key Frames
We know that the key frames contribute more to the output than ordinary frames. Therefore, we compute the output of the target class
g of each frame. The key frame will take the largest one of that value. Specifically, we define
, a length
T vector, where all elements of it are 0 initially. Then, we change
element of
from 0 to 1, then compute the
as the contribution of that frame
to the target class. We construct a list
to record that value, where each element
of it is the contribution of frame
. Finally, the frame with the largest value in the list is the key frame
of that iteration. It can be shown as Equation (
7).
3.2.2. Estimate the Forward Derivative and Saliency Map
In black-box attack, as we cannot access the full knowledge of the model, we cannot compute the forward derivative directly, so we consider another method to acquire it. We estimate the forward derivative
of super-pixels
within key frame
based on NES as the contribution of that super-pixel to the output. Different from maximizing the expected value of the loss function [
27], we maximize that of the output under the searching distribution
. For a frame
, we compute it as Equation (
8) [
15]:
With a trick similar to [
15], we choose the antithetic searching distribution of random Gaussian noise around
as
, where
,
is a constant, and
has the same size as
. For sampling, we set
,
. And it sets
, where
. Each element of
and
are defined as
and
, respectively,
, and
M is the number of super-pixels of
. Lastly, the forward derivative of key frame
can be estimated with Equation (
9):
The is jth element of . Similar to the white-box attack, we then construct a saliency map of the key frame .
3.2.3. Search for Key Pixels
In white-box attack, we change n when the algorithm drops into the “endless loop”, a trick that will automatically adjust the key frame’s index. However, in black-box attack, we should change the key frame index manually. The reason is that the perturbations added to each key pixel are so small that it cannot change the contribution value of that frame with these modified pixels. Therefore, the algorithm will select the same key frame in the following iteration, and if we let that situation continue, the frame chosen will continue to be disturbed, which leads to more perceptible noise. Moreover, the selected frame will choose other pixels that contribute less to the output to disturb, decreasing the perturbation effect. Hence, we should change a frame and spread these perturbations over different frames. We implement that trick by setting an iteration boundary so that the algorithm will change the frame index when the iteration reaches that value.
Indeed, changing the index of the key frame manually by changing n can bring another advantage, that users can set different n according to different situations. For example, if an application scene focuses on the perturbation of each frame but does not constrain the number of disturbing frames, we can set n to be large, but if an application scene can only change fewer frames but has a higher tolerance for per-frame disturbances, we should set n to be minor but the boundary value of disturbed pixels large and minor.
The whole algorithm in the targeted attack of a black-box attack is summarized in Algorithm 2. Parameters of a black-box attack are only two more than those of white-box. They are and .
3.2.4. Analysis of the Algorithm under the Black-Box Setting
This section analyzes the computational complexity and the implementation cost of the AVEs generation method under the black-box setting. In the algorithm’s loop under the black-box setting, the key step uses SLIC for the key frames and estimates the forward derivative of the super-pixels of the key frame. According to the work [
14] and the work [
15], the complexity of SLIC and NES algorithms are
and
, respectively.
n is the number of objects that are needed to calculate super-pixels, the
q is the number of the forward derivatives, and the
m is the sampled points to help estimate forward derivatives. In that algorithm, the number of pixels that are needed to calculate super-pixels is
, the super-pixels of a key frame, and the number of the forward derivatives is
, which is the number of super-pixels. Because the maximum number of the loop is
, the worst complexity is
, and the best complexity is
. According to the definition of infinite frequency, the complexity can be simplified as
.
Algorithm 2 Crafting the Black-Box Attack in the Target Mode |
Input Input Parameter Output
- 1:
Let , , . - 2:
Set the target label . - 3:
while and do - 4:
for to C do - 5:
- 6:
. - 7:
end for - 8:
. - 9:
if then - 10:
Add to the set of key frame . - 11:
end if - 12:
Reduce the dimension of using SLIC: - 13:
Estimate forward derivative by Equation ( 10). - 14:
Construct for key frame by Equation ( 4). - 15:
Select super-pixels with top n value of as key super-pixels of key frame . - 16:
Select n key pixels within of . - 17:
. - 18:
if all key pixels ∉ or then - 19:
. - 20:
. - 21:
end if - 22:
- 23:
end while - 24:
return
|
However, the algorithm under the black-box setting needs to query the DNNs model frequently. Therefore, in our paper, we use the metric queries to evaluate the time complexity of that algorithm. The detailed results are shown in
Section 4.1.3.