1 Introduction

In critical applications, it is already hard to be sure if we can trust the predictions of existing neural network models (Guo et al., 2017). The potential presence of an adversary makes it even harder to trust these systems. Is there a method to create classifiers that are more robust, and less overconfident? In existing adversarial defenses, we show that the classifier softmax output overestimates the probability its predicted class is correct, specially for powerful attackers (see Fig. 1).

Fig. 1
figure 1

(CIFAR-10, ResNet-18) Confident predictions of an adversarially trained model (Ma̧dry et al. (2018) with \(\epsilon =^{8}\!\big /\!{_{255}}\)) surprisingly increases as attacks become stronger

A robust classifier must be (approximately) invariant to any transformation of the inputs which doesn’t change the true label. Another desirable property is to avoid the overconfidence of its predictions by not overestimating the probability that its predicted class is correct—for example, whenever the classifier’s softmax output makes a prediction with confidence of 0.9, then it must be correct at least 9 out of 10 times.

This is particularly important in the context of adversarial machine learning, where the neural network’s failure to learn the correct input invariances enables an adversary to craft small perturbations in the images which change the classifier’s predictions. This has created a perpetual arms race of new methods of crafting adversarial examples (Kurakin et al., 2017; Ma̧dry et al., 2018; Carlini & Wagner, 2017b) for which new defenses are proposed (Papernot et al., 2017; Guo et al., 2018; Dhillon et al., 2018), which are then (again) fooled by newer attack methods (Athalye et al., 2018).

In this work, we leverage the relationship between similar images to build a defense that replaces the last layer and softmax output of a classifier with a graph-based method that: (1) significantly reduces overconfidence; (2) survives adversarial attacks stronger than what was seen during training; (3) does not require retraining of the classifier.

Under the assumption that the neural network maps similar examples to nearby regions of the embedding space, we create a graph by connecting images whose representation are close to each other (see Fig. 2). Then, we make use of a Gossip algorithm (Boyd et al., 2005), to diffuse the labels through this graph, which improves the model’s predictive uncertainty. Moreover, our defense refuses to give overconfident answers to test images which are nowhere close to the curated validation examples, such as those under strong adversarial attacks.

Note that a direct calibration objective under adversarial attacks would be the wrong goal, since calibration with an imperfect classifier requires knowing the distribution of the test data, i.e., knowing what the adversary will do. Calibrating existing machine learning models is a topic which has been explored for decades (Platt, 1999; Niculescu-Mizil & Caruana, 2005; Naeini et al., 2015) for traditional classifiers and recently gained focus in the context of deep learning models (Guo et al., 2017). But these approaches do not target avoiding overconfident predictions and frequently are not developed in the context of adversarial examples (where we observe an unknown distribution shift at test time).

Moreover, since our proposed framework is based on the representations learned by the deep learning model, it can easily be combined with a multitude of current (and future) defenses (e.g. adversarial training), without incurring additional cost of retraining the model.

Fig. 2
figure 2

Images whose representations are close (as measured by \(L^p\) norm), are connected, to form a graph. The choice of the radius has a direct impact on the connectivity of the graph, which influences how we diffuse label information, to reduce overconfidence of the predictions

2 Related work

While the field of adversarial examples is full of interesting developments over the past few years, doing a thorough review of the literature is beyond the scope of this work. Instead, we focus on those publications which are more closely related to our work and are more relevant to the discussion we present in this paper.

The study of adversarial examples within the context of deep learning models picked up interest after the work of Szegedy et al. (2013). In the following years, the community started an arms race of developing stronger attacks and developing defenses which are then broken by even stronger attacks (Carlini & Wagner, 2017a). Despite this constant development, some of the simpler methods, such as the Projected Gradient Descent (PGD), developed by Ma̧dry et al. (2018) are still very effective and an important tool to evaluate new defenses and techniques.

While many important improvements have been made in securing deep learning models and making them more robust to adversarial examples, most works in the field focus only on improving accuracy and few works even look at the calibration of the produced classifier. While the topic of calibration has been extensively studied for decades, from the analysis of weather forecasters in the 80 s (DeGroot & Fienberg, 1983), to the standard machine learning classifiers such as SVM (Platt, 1999), Logistic Regression, Naïve Bayes and others (Naeini et al., 2015; Niculescu-Mizil & Caruana, 2005), it was first studied in the context of deep learning by Guo et al. (2017). The Expected Calibration Error (ECE) metric we use was first proposed by Naeini et al. (2015), while the Overconfidence Error (OE)—its counterpart that only considers the miscalibration caused by overconfident predictions—was proposed by (Thulasidasan et al., 2019). The improvement of calibration (and in particular reducing overconfidence) of deep learning models under adversarial attacks is an essential step to obtain a reliable and trustworthy classifier.

Open set recognition and other approaches: a related area of research is focused on the problem setup where unknown (or new) classes are present at test time. See Geng et al. (2020) for a survey of recent advances in this area. Also related are works which not only aim at recongizing that an object is of an unknown class, but also try to continuously learn, by incorporating new classes in successive rounds of training (Dai et al., 2021). Our work, however, focuses on the adversarial setup, where the set of possible classes is fixed and an adversary modifies an input to (incorrectly) change the model prediction.

Adversarial training and other similar procedures: Perhaps, the most common defense against adversarial examples is retraining the classifier, augmenting the training data with adversarial examples (adversarial training) (Goodfellow et al., 2015; Ma̧dry et al., 2018; Wang et al., 2019). Together with adversarial training, are other strategies which seek to harden the classifier by modifying the training procedure, usually by employing a regularization strategy that encourages some type of smoothness in the final model (preventing small perturbations of the input from producing large changes in the final output). Among such methods are Parseval Networks (Cisse et al., 2017), which seek to train networks with low (\(<1\)) Lipschitz constants and Laplacian Networks (Lassance et al., 2021) which use tools from Graph Signal Processing to enforce smooth variations of the class boundaries. These works are not alternatives to our work: they can be combined with our proposed strategy, since we can use the (better) embeddings produced with such methods. In fact, in our experiments, we use the embeddings of an adversarially trained model, since they provide a considerable improvement over vanilla (undefended) neural networks.

Recently, a trade-off between being accurate and being robust has been shown (Tsipras et al., 2019; Zhang et al., 2019), and Hendrycks et al. (2019) show that pre-trained models can be an important step in improving robustness. A better understanding of the the geometry of adversarial examples has emerged: traditional adversarial examples are often fragile to random perturbations, that is, a small perturbation added to an adversarial example will revert back to the original class (Roth et al., 2019; Elliott et al., 2021; Hosseini et al., 2019; Hu et al., 2019; Guo et al., 2018). These works can be seen as simple approaches for classifier reliability, since they can output class priors whenever an adversarial example is detected, even if reducing overconfidence or improving calibration is not their main goal. These perturbations have inspired new defenses (Hu et al., 2019; Gopalakrishnan et al., 2018; Xie et al., 2018). This robustness, however, relies on a weak adversary that can only find fragile adversarial examples. This premise has been recently put into question with trivial changes to the adversary (Hosseini et al., 2019).

Pinto et al. (2021) propose Mix-MaxEnt, a regularization approach to improve uncertainty quantification. Serban et al. (2021) propose Deep Repulsive Prototypes, a modified training proceedure, with a distance-based loss to encourage separation of classes, which is competitive with adversarial training. However, both approaches require re-training and cannot be applied post-hoc.

Other post-hoc defenses: In the realm of post-hoc defences, other works have used the embeddings to construct alternative classifiers, mostly with the use of k-Nearest Neighbor (k-NN) over the pre-trained embeddings. This approach was shown successful in simply improving accuracy in language models (Khandelwal et al., 2020), while for adversarial reliability, k-NNs have inspired defenses using the embeddings of neighbors from training data (Sitawarin & Wagner, 2019a) and embeddings of multiple distinct (Dubey et al., 2019) datasets. Papernot and McDaniel (2018), propose Deep k-NN, which uses k-NN together with principles from conformal predictions to improve reliability. Unfortunately, new attacks have been shown effective against such methods (Sitawarin & Wagner, 2019b). The use of a nearest neighbor approach to find related examples, however, could potentially consider examples very far from each other, since their notion of being related is restricted by quantity of such neighbors, rather than fixing a distance as we do in our proposed method. Moreover, the existing k-NN defenses do not address the potential Palm distribution bias, and performed poorly in our experiments, becoming increasingly overconfident when confronted with stronger attacks.

Recent approaches (Liu et al., 2020; van Amersfoort et al., 2020; Mukhoti et al., 2021) proposed replacements of the final dense layer and softmax output, to promote distance-aware uncertainty quantification, based on Gaussian processes, RBF kernels and discriminant analysis. In contrast to our work, their methods require modifications of the training procedure (e.g. spectral normalization or gradient penalty regularization) and/or architecture and their work focus on improving out-of-distribution detection, rather than investigating strong adversarial shifts. For example, in the work of Mukhoti et al. (2021), if an image is incorrectly identified as in distribution, their method will still use the softmax outputs, which may be overconfident.

Recently, Hess et al. (2020) and Wang and Loog (2022) propose post-hoc modifications of the softmax, based on theoretical reinterpretations of the last-layer + softmax output. We provide an analysis of these methods in the Appendix.

Certified defenses, such as Randomized Smoothing (Cohen et al., 2019) have gained attention as they provide theoretical (and practical) guarantees of robustness for perturbations within a given radius around images. But recent attacks have been effective against it (Ghiasi et al., 2020) and, in our experiments, such defense displayed poor calibration and strong overconfidence.

Gossip algorithms: The use of gossip algorithms to perform distributed computations has been widely studied in the context of sensor and computer networks (Boyd et al., 2005), where it first emerged. Gossip algorithms have been used to develop a distributed SGD training strategy for neural networks (Blot et al., 2019). GossipNet (Hosang et al., 2017) is a CNN architecture, which uses concepts from gossip algorithms by doing a message passing operation among neighboring sets of pixels in a single image. While all these works are based on Gossip algorithms, they are unrelated to our approach as they are not concerned with reducing overconfidence of the classifier and do not build a graph from the datasets (as our approach does). Instead, they are focused on distributed training or on exchanging information within a single image, whereas our use of gossip algorithm is to diffuse label information among similar images on the dataset. Our use of the gossip algorithm, in particular as a defense strategy against adversarial examples is novel.

Graph-based algorithms: Graph based algorithms have been used before in the context of image processing, but most such works build a graph over pixels or regions of a single image (Felzenszwalb & Huttenlocher, 2004; Shekkizhar & Ortega, 2020), with the goal of denoising or performing other (single image) processing operation. In our work, instead, we use the embeddings of images to build a graph that encodes how related (similar) the images are to each other. We also note that while we experimented with a couple of ways of creating such a graph of examples, other methods could easily be incorporated into our proposed gossip framework.

Finally, we want to distinguish our work from those which are focused on attacking graph-structured data and models, such as GNNS (Dai et al., 2018; Bojchevski & Günnemann, 2019; Liao et al., 2020). In this setting, the graph is the input to their models, which produce embeddings used for downstream tasks. The attackers’ goal is to extract information from the neighborhood or to influence the embeddings by modifying the input graph (adding or removing nodes or edges).

In our proposed framework, the input data is not graph-related—we build the graph ourselves based on the mapping produced by the neural network. And our graph is constructed from a curated set of validation examples, which an adversary cannot modify.

3 Classifier calibration and uncertainty quantification preliminaries

A classifier is said to be well-calibrated when its assessment of its own uncertainty is accurate, that is, the confidence of its predictions (e.g. softmax values) matches its accuracy (i.e. among predictions made with 80% confidence, 8 out of 10 are correct). When the classifier makes more mistakes than expected (according to the confidence values), it is overconfident.

As shown in Fig. 1, strong distribution shifts (e.g adversarial attacks) can cause models to be exceedingly overconfident (even for models adversarially trained to defend against such attacks). When such models can have their accuracy compromised by adversarial attacks, it is desirable to at least have a reliable assessment of its uncertainty, by not being overconfident, so we are not misled into making bad decision due to wrong confidence estimation.

In this work, we assess the calibration of models with the Expected Calibration Error (ECE) (Naeini et al., 2015; Guo et al., 2017), where we bin predictions based on their confidence values and average the difference between accuracy and confidence of each bin. Another important metric is the Overconfidence Error (OE), derived from ECE, where we only consider the bins for which the confidence is higher than the accuracy. More formally, consider a dataset \({\mathcal {D}}=\{(x_i, y_i)\}_{i=1}^N\) and denote by \(B_1, \dots , B_m\) the m bins of the predictions, based on their confidence. We compute the metrics as:

$$\begin{aligned} ECE&= \sum _{b=1}^m \frac{|B_b|}{N} | \text {conf}(B_b) - \text {acc}(B_b) | \end{aligned}$$
(1)

and

$$\begin{aligned} OE&= \sum _{b=1}^m \frac{|B_b|}{N} \left[ \text {conf}(B_b) \times \text {overconf}(B_b) \right] , \end{aligned}$$
(2)

where \(\text {conf}(B_b)\) and \(\text {acc}(B_b)\) are the average confidence and accuracy of the examples in bin \(B_b\) and \(\text {overconf}(B_b) = \max \left( 0, \text {conf}(B_b) - \text {acc}(B_b)\right) .\)

4 Methodology

Having introduced the necessary concepts in the previous section, we now focus on addressing the hypothesis: If we incorporate concepts of stochastic geometry to build a relational method, to be applied on top of the embeddings produced by a pre-trained neural network, we can reduce overconfidence of the predictions, when compared to the original fully-connected + softmax predictions, in particular for adversarial attacks stronger than used in training.

We seek to leverage the representational power of the neural network by extracting embeddings of the set of clean validation images and using them to build a graph-based estimator, under the assumption that similar images will be have their embeddings mapped close to each other.

We denote the neural network architecture by \(f(x) = h(z(x))\), where z(x) is the output after all the convolutional layers (a d dimensional vector), and \(h(\cdot )\) is the final dense layer and softmax. We replace \(h(\cdot )\) with a graph-based estimator constructed based on the distance between the embeddings z(x) of validation examples. This graph is then used in a label diffusion algorithm to obtain an estimation of \(P(Y \mid X)\) for validation images, with reduced overconfidence. At test time, our predictions are based on this new estimation of the label distribution for the relevant validation images (i.e. validation images whose embeddings are close to the test image).

4.1 Building a graph from embeddings of validation examples

As our method is based on proximity of embeddings of the images, we have found that the smoothness of \(z(x)\) has a direct impact on the accuracy of our defense. See Sect. B.2 for a discussion of limitations of this assumption. For this reason, we extensively employ adversarially trained embeddings in our experiments, and argue that any improvement that can produce more robust embeddings can be combined with our approach to improve its robustness.

We build a graph from the embeddings of (clean) validation examples, by treating each validation image as a node in our graph and connecting the images whose embeddings are within a radius \(r\) of each other (illustrated in Fig. 2). In what follows, we denote by \({\mathcal {D}}{^{(\textrm{vl})}} = \{(x_i, y_i)\}_{i=1}^{N^{(\text {vl})}}\) the set of \(N^{(\text {vl})}\) validation images and their labels. First, for each image \(x_i\), we identify all other neighbor images within the radius r:

$$\begin{aligned} {\mathcal {N}}_r\left( z(x_i)\right) = \{j \in [N^{(\text {vl})}]: \Vert z(x_j) - z(x_i)\Vert \le r\}, \end{aligned}$$
(3)

as shown in Fig. 2a. We then add the edges \((i, j)\) connecting that node with each of its neighbors in \({\mathcal {N}}_r\left( z(x_i)\right)\), as seen in Fig. 2b.

The choice of the radius r can impact on the connectivity of the graph we build, which then impacts the diffusion of label that we will perform based on this graph. As the radius increases, the graph becomes more connected, as seen in Fig. 2c, d. We tried two strategies to pick the radius, based on ensuring a connected graph or a minimum edge density (what fraction of all possible edges to include in the graph). We observed similar performance across both strategies. For completeness, in Sect. 5.3 discuss the results obtained with each strategy.

4.2 Diffusion of labels via Gossip algorithm

To obtain a smoother and less overconfident estimation of \(P(Y |X)\), we diffuse the (one-hot encoded) true label signal through the graph, so that, for each example, we obtain a new distribution that incorporates the label information from nearby examples as well. We will denote by \(p(k |x_i; r)\) this new estimation of \(P(Y = k \mid X = x_i)\), for the validation example \(x_i\), based on the graph of radius r.

The final outcome of this label diffusion, however, needs to fulfill two important desirable properties:

  1. 1.

    For each validation example, \(p(\cdot |x_i; r)\) must describe a valid probability distribution over the classes:

    $$\begin{aligned} \sum _{k = 1}^C p(k \mid x_i; r) = 1. \end{aligned}$$
  2. 2.

    The diffusion process must not change the marginal distribution of labels:

    $$\begin{aligned} \frac{1}{N^{(\text {vl})}}\sum _{i=1}^{N^{(\text {vl})}}p(k |x_i; r) = \frac{1}{N^{(\text {vl})}}\sum _{i=1}^{N^{(\text {vl})}} \mathbbm {1}\!\!\left( y_i = k\right) . \end{aligned}$$

To achieve such properties, we took inspiration in Gossip Algorithms (Boyd et al., 2005), a class of iterative distributed algorithms used for computing averages over networks of sensors or ad-hoc and peer-to-peer networks. At the center of a gossip algorithm is the idea that, at a given iteration, nodes will exchange information with its neighbors and replace its current value with an average of it and the values received from the neighbors. These algorithms are generally focused on reaching a point of equilibrium, but we, instead, focus on the transient phase, performing a finite number of exchanges, so that the influence of an image remains local.

The information being exchanged is the estimated label distribution \(p(\cdot \mid x_i; r)\), which we initialize at time \(t=0\) with the one-hot encoding of the true label. Then, at each iteration, the nodes update \(p\), forming the following dynamic process:

$$\begin{aligned} p^{(t)}(k \mid x_i; r)&= \sum _{j \in {\mathcal {N}}_r\left( z(x_i)\right) } \left[ \Pi _r\right] _{ij} p^{(t-1)}(k \mid x_j; r), \end{aligned}$$

with \(p^{(0)}(k \mid x_i; r) = \mathbbm {1}\!\!\left( y_i = k\right)\), and with \(\left[ \Pi _r\right] _{ij}\) denoting the weight that node i gives to the information coming from j. Section 4.2.1 describes how to construct \(\Pi _r \in ^{N^{(\text {vl})}\times N^{(\text {vl})}}\) from the graph structure.

In a more general sense, if we stack the vectors \(p(\cdot \mid x_i; r)\) into a matrix \(p({\mathcal {D}}{^{(\textrm{vl})}}, r)\), we can describe the final values, after \(M\) rounds as the following matrix multiplication:

$$\begin{aligned} p^{(M)}({\mathcal {D}}{^{(\textrm{vl})}}, r) = \Pi _r^M {{\varvec{Y}}}_{OH}, \end{aligned}$$
(4)

where \({{\varvec{Y}}}_{OH} \in ^{N^{(\text {vl})} \times C}\) is the matrix built by stacking the one-hot encoding of the labels of each node of our graph. As before, the matrix \(\Pi _r\) describes, at each round, how node i incorporates the information coming from its neighbor j. With the matrix description of Eq. (4), we can implement the diffusion with matrix operations, with time complexity of \({\mathcal {O}}(MN^{\text {vl}}C)\), which can be accelerated with a GPU. By its nature and origin, it is scalable for large datasets due to its distributed nature.

If we choose a large enough value of M in Eq. (4), it will converge to the average of the labels in the dataset (the marginal distribution \(P(Y)\)). While this would be perfectly calibrated, having the same (prior) distribution for all validation points is undesirable. Instead, we stop the gossip algorithm before convergence, allowing for some diffusion of the label information through the nearby examples and this number of rounds M, which we treat as a parameter to be tuned, gives us control of how much we want to smooth the label information over the dataset.

4.2.1 Construction of the diffusion matrix \(\Pi _r\)

An integral part of the method is the matrix \(\Pi _r\), which controls how the label diffusion is done and how fast it converges. As per Eq. (4), this matrix controls how each node averages the value of \(p\) coming from the other nodes, at each round. One required property of \(\Pi _r\) is for it to be doubly-stochastic, i.e., \(\Pi _r\)’s rows and columns both sum to one. This constraint ensures that the total probability that a given label y is predicted matches the frequency of label y in the validation data. Using a doubly stochastic matrix also guarantees the two desired properties described in the previous section.

A natural choice is to use the adjacency matrix, as it already encodes the relationship between the validation examples. To make it a doubly-stochastic matrix, we apply the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967), which iteratively normalizes the rows and columns of the matrix. The advantage of this approach is that the exchange of labels happens only between neighbors in the graph. The time complexity of this step is \({\mathcal {O}}(kn^2)\) where k is the number of iterations and n is the size of the graph (usually \(k \ll n\)). Since it is implemented with matrix operations, this step can also be accelerated by using a GPU.

We also experimented an alternative strategy to construct \(\Pi _r\) using Personalized PageRank, as a way to offload some of the label diffusion to the PageRank computation. In Sect. 5.3 we compare this PageRank approach with the simpler approach of applying Sinkhorn-Knopp to the adjacency matrix. In practice, both performed similarly.

We also note that this method puts an excessive emphasis on the information coming from neighbors, and little reliance on the original signal of the node itself. To counter this effect, we add a final step of reinforcing the diagonal entries in the produced doubly-stochastic matrix \(\Pi _r^\prime\), by mixing it with an identity matrix: \(\Pi _r = \rho I + (1 - \rho ) \Pi _r^\prime\). The lazy diffusion parameter \(0 \le \rho < 1\) controls how much we should trust the original label of the node itself versus how much to rely on the gossip diffusion process and is tuned as a hyperparameter.

4.3 Making predictions at test time

The graph-building and gossip diffusion we described above needs to be performed only once. After computing the \(p(k |x_i; r)\) for all the validation examples, the predictions for a test example x are much simpler: First we identify which validation points would be neighbors of x in the graph \({\mathcal {N}}_r\left( z(x)\right)\). If this set is empty, we simply output the marginal \(P(Y)\).

When the set of neighbors \({\mathcal {N}}_r\left( z(x)\right)\) is not empty, we output the average of their estimated distributions:

$$\begin{aligned} {\hat{p}}_\text {Avg}(\cdot |x; r) = \frac{1}{ |{\mathcal {N}}_r\left( z(x)\right) \mid } \sum _{i \in {\mathcal {N}}_r\left( z(x)\right) } p^{(M)}(\cdot |i; r). \end{aligned}$$

Alternatively, we also investigate an attempt to counter-act a potential finite-sample bias of \(z(x)\) being more likely closer to high-degree nodes in the graph, if the test x is a clean image (since high-degree nodes are in denser regions of \(P(X)\)). In this case, we apply the Horvitz-Thompson estimator (Horvitz & Thompson, 1952), which corrects for estimator biases known up to a constant:

$$\begin{aligned} {\hat{p}}_\text {HT}(\cdot \mid x; r) = \left( \sum _{j \in {\mathcal {N}}_r\left( z(x)\right) }\frac{1}{d_j} \right) ^{-1} \sum _{i \in {\mathcal {N}}_r\left( z(x)\right) } \frac{p^{(M)}(\cdot \mid x_i; r)}{d_i}, \end{aligned}$$

where \(d_i\) is the degree of node i. This version estimates the label distribution of the region defined by the hyper-sphere of radius r around the test point—the relative ratio of degrees among the validation points in this region (the neighbors of the test point) reflect their sampling probability.

The time complexity of making predictions at test time (given the pre-computed graph and the embeddings of test images) is bounded by the cost of computing the distance to validation points, which is \({\mathcal {O}}(N^{\text {vl}}d)\) and can be done in parallel. In our experiments, making predictions for a test set of 8000 images of size \(3 \times 32 \times 32\) takes less than a second on an NVidia GeForce 1080 Ti GPU, comparable to other defenses we used.

4.4 Estimator consistency, Gilbert graphs, and palm calculus

In what follows we prove the consistency of our estimator in Eq. (4). We also show that the graph we built in Sect. 4.1 is a Gilbert graph (Penrose, 2003), whose connectivity properties follow that of continuum percolation, with the main result over a compact set \(A \subset {{\mathcal {X}}}\) given by Baccelli and Błaszczyszyn (2009).

In order to prove this property, we first introduce a few concepts from stochastic geometry (see Daley and Vere-Jones (2007) for a reference). Proposition 1 below proves that our validation image embedding dataset \(\{(z(x_i),y_i)\}_{i=1}^{N^\text {(vl)}}\) can be described as \(N^\text {(vl)}\) samples of a Marked Poisson Point Process (MPPP), where the embedding z(X) has an associated mark Y (the image’s class).

Proposition 1

If the \({\mathcal {D}}^\text {(vl)}\) and \({\mathcal {D}}^\text {(te)}\) datasets are MPPPs, and the embedding function \(Z:{\mathbb {R}}^d \rightarrow {\mathbb {R}}^{d'}\) is injective, then the embedding datasets \({\mathcal {D}}_Z^{(vl)}\) and \({\mathcal {D}}_Z^{(te)}\) are also MPPPs. Finally, if the derivative z(x) exists around the infinitesimal ball \(\lim _{r \rightarrow 0} B_r(z(x))\), the number of neighbors \({\mathcal {N}}_r\left( X\right)\) in the \(B_r(z(X))\) ball, \(X \in {\mathcal {D}}^\text {(vl)} \cup {\mathcal {D}}^\text {(te)}\), is Poisson distributed with intensity proportional to \(\lambda (X)|\det \frac{\partial z}{\partial x}|\), where \(\lambda (X)\) is the spatial distribution density of a clean image X in the validation data.

The proof is in the Appendix. It then follows directly that our graph is a Gilbert graph.

Another important property is the distribution of embeddings in the neighbors \({\mathcal {N}}_r\left( z(x_i)\right)\) of validation image \(x_i\), \(i=1,\ldots ,N^\text {(vl)}\), which is a Palm distribution. A Palm distribution (Palm, 1943) of a point process is the conditional distribution of the point process given one of its points \(z_0\) at the origin, denoted \(P^0_{z_0}\). The proof of Proposition 1 (in Appendix) also shows that the Palm distribution \(P_{z_0}^0\) of the clean image embeddings observed around a validation example embedding \(z_0\) is the true distribution of clean image embeddings as if the validation point were not there.

Finally, we use these results to show that the estimator in Eq. (4) is consistent.

Proposition 2

Let \(p(y \mid z)\) be uniformly continuous on z, that is \(\forall \epsilon> 0, \; \exists \; \delta > 0\) such that \(\forall z_1, z_2\) with \(\Vert z_1 - z_2\Vert \le \delta , \mid p(y \mid z_1) - p(y \mid z_2) \mid \le \epsilon\). Then, \(p^{(M)}\) in Eq. (4) converges to \(p(Y \mid x_i)\) for a sufficiently large number of validation examples \(N^\text {(vl)} \rightarrow \infty\) and an appropriate choice of radius \(r > 0\) in Eq. (3) and matrix power \(M \ge 1\) in Eq. (4).

The proof is in the Appendix.

Why we cannot construct the graph with training data

The consistency result is no longer true if we build our graph with the training data in lieu of the validation data. The neighboring image embeddings of a training example may be geometrically distorted (i.e., the Palm and true distributions can be different), since the embeddings no longer would form an MPPP (since \(z(\cdot )\) depends on the training data). In summary, we should never use training in lieu of validation data to build our graph.

4.5 Choosing hyperparameters

For a given graph and corresponding diffusion matrix, we tune two remaining parameters for our gossip approach: the number of rounds M and the lazy diffusion parameter \(\rho\). We considered values for M in the range [10, 200] and for \(\rho\) between (0, 1), in a grid search. We evaluated the configurations under a PGD attack with \(\epsilon ={}^{8}\!\big /\!_{255}\), removed redundant ones using a Pareto frontier and selected the best configuration using an utility function that balances accuracy, calibration and the need to perform too many rounds. More extensive discussion can be found in the Appendix.

5 Results

5.1 Experimental setup

Dataset and architecture: We follow a similar experimental setup as Ma̧dry et al. (2018), on image classification in the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), using the ResNet-18 (He et al., 2016) architecture, adversarially trained using the procedure from Ma̧dry et al. (2018) with \(\epsilon ={}^{8}\!\big /\!_{255}\), which we denote as AT. We employ a 5-fold cross validation scheme, described in detail in the Appendix. These models were trained using NVidia GeForce 1080 Ti GPUs with PyTorch (Paszke et al., 2019), with learning rate of \(10^{-2}\) and batch size of 64. Our code will be made available after acceptance. More details about the training and the resources used can be found in the Appendix.

Attacks: We evaluate the defenses under a collection of \(L^\infty\) attacks: Fast Gradient Sign Method (FGSM) (Szegedy et al., 2013), Projected Gradient Descent (PGD) (Ma̧dry et al., 2018) and the method from Carlini and Wagner (2017a), using the implementation from the Adversarial Robustness Toolbox (ART) (Nicolae et al., 2018). For each image we select the attacked version that was most successful at fooling the defense being evaluated. We vary the maximum allowed perturbation between \({}^{2}\!\big /\!_{255}\) and \({}^{24}\!\big /\!_{255}\) by steps of \({}^{2}\!\big /\!_{255}\). The second type of attack we used is LaVAN (Karmon et al., 2018), an attack where only a small square patch of the image is changed, without restriction of how much noise is added to it. We varied the side size of the square patch from 2 to 16 pixels, by multiples of 2. More details about this can be found in the Appendix.

Note that, during training, the model is only exposed to a PGD attack with \(\epsilon ={}^{8}\!\big /\!_{255}\), but in the evaluation, it is presented not only with stronger versions of the same attack, but also with different types of attack all together. We aim to prepare our defense to the worst-case, reducing its overconfidence even for such stronger and unforeseen attacks.

Other defenses: Besides the baseline adversarially trained (AT) model (with the original dense layer + softmax), we also compare Temperature Scaling (Guo et al., 2017), which improves calibration by adding a temperature to the logit values. We also tried a variation of temperature scaling where we set the temperature arbitrarily high (infinite), which we denote by TS\(_\infty\), to investigate a simple alternative that simply forces the predicted probabilities to resemble an uniform distribution.

We also tested two defenses which use the (adversarially trained) embeddings for k-Nearest Neighbor approaches: a simple version that averages the one-hot encoded label of neighbors and Deep k-NN (Papernot & McDaniel, 2018), which combines k-NN with conformal predictions that incorporates the dissimilarity of the test example when compared with validation and training examples. Due to the excessive memory requirements of Deep k-NN when using the outputs of all layers (42 GiB for our setting), we use only the outputs of a single layer, the same one we use as embeddings in our method, similar to the approach by Sitawarin and Wagner (2020). Finally, we also include softRmax (Wang & Loog, 2022), a polynomial version of softmax, aimed at reducing overconfidence.

5.2 Discussion of the experiments

Both types of attack we use have a simple knob that can be used to change the strength of the attack: the allowed \(L^\infty\) norm of the distortion (\(\epsilon\)) or the size of the patch. We used these knobs to study how the defenses cope against a wide range of attack strengths, in particular for strong attacks which the models didn’t see in training. We present our main results in Figs. 3, 4 and in Sect. 5.3 we include more results using variations of the graph and diffusion matrix, which perform similarly. We use the ECE and OE metrics (Eqs. (1) and (2)) as well as accuracy to evaluate the defenses, with a focus on overconfidence, as is the major cause of miscalibration under medium and severe attacks, and a great cause of concern in many applications.

To guide our analysis, we identify three regimes of the attack’s strength: mild, medium and severe attacks. We emphasize that these thresholds are just visual aids to simplify our exposition. We do not claim to have a general definition of attack strength, but rather selected thresholds that helps describe the patterns we observe.

Fig. 3
figure 3

Performance of defenses over varying strength of standard \(L^\infty\) attacks (worst case between PGD, FGSM and CW). Our approach is particularly effective at reducing overconfidence for stronger attacks

Fig. 4
figure 4

Performance of defenses over varying sizes of patch attack (LaVAN). While other defenses show worse calibration as patches get larger, our approaches display lowest OE and ECE

5.2.1 Mild attacks

We name mild attacks, those with \(\epsilon < {}^{8}\!\big /\!_{255}\) (the value used for adversarial training) or patch sizes smaller than 7, which we mark in the left region of Figs. 3 and 4. We observe that only randomized smoothing is significantly overconfident in this range, while most other methods display their best ECE values in this range. Since our method has the goal of reducing overconfidence (which is confirmed by achieving 0% OE), most of the miscalibration we present in this range is due to being underconfident, as we tend to give more conservative predictions than other methods. This comes at the cost of a drop in accuracy under the clean images, but as the attacks get stronger, we see similar accuracy as the other defenses.

The only defenses which reach as low OE as we do are: temperature scaling with infinite temperature (AT + TS\(_\infty\) in the plots) and softRmax. However, TS\(_\infty\) has the most uninformative answers (akin to random guess), which means it also has the higher miscalibration, precisely for that reason, showing around 80% ECE, even for clean images, 40% worse than our approach. And softRmax also suffers from the same problem (too underconfident).

5.2.2 Moderate attacks

In this range, defined by \({}^{8}\!\big /\!_{255} < \epsilon \le {}^{18}\!\big /\!_{255}\) and patch size between 7 and 11 pixels, we start seeing attacks stronger than what was used during training, but not too strong to render the defenses worse than random guess. For the patch attacks, it is during this range that adversarially trained models (and temperature scaling variants) start to underperform as compared to others, as we see in Fig. 4 where our accuracy is almost 4 times higher and in Fig. 3 where OE and ECE is about 30% lower for our approach.

All defenses increase their overconfidence during this regime, except for our approach, TS\(_\infty\), and softRmax, as seen in Figs. 4 and 3. In particular, Randomized Smoothing is the most overconfident, reaching close to 100% OE for \(L^\infty\) attacks (cf Fig. 3). Most defenses display between 6 and 8 times worse OE when compared with our gossip approach.

While this increase in overconfidence is reflected in the increase of ECE for most defenses, the progressive strengthening of the attacks starts justifying the reduction in overconfidence achieved by our approach, which is why we reach the lowest ECE values towards the end of this range for \(L^\infty\) attacks (Fig. 3), making it better calibrated for the moderate-severe attack regime.

5.2.3 Severe attacks

In this last range (\(\epsilon > {18}\!\big /\!_{255}\) and patch size larger than 11 pixels), the attacks are so strong that most defenses’ accuracy becomes worse than that of a random guess, as seen in Fig. 3. For LaVAN (Fig. 4), we see the original AT defense and its temperature scaling variants drop to worse-than-random accuracy, while all others sustain better accuracy, with Randomized Smoothing as the most accurate, but also the most overconfident.

The calibration in this range, however, tells an opposite story. While a true random guess would show 0% ECE, most defenses reach upwards of 70% ECE in this range. The benefits of our approach are even more remarkable in this range, as we have much better ECE and OE than other defenses. In particular, for \(L^\infty\) attacks (Fig. 3), our approach is up to 50% less overconfident than other competing approaches. Other defenses are the most confident when they are making the most mistakes, while the gossip approaches show a significant reduction in the overconfidence everywhere (See the Appendix for statistical significance tests).

While for moderate and severe attacks our method’s accuracy is very much in par with other defenses, we observe a drop in accuracy for the mildest attacks, which we attribute to our intent on being conservative in our predictions. The parameters of our gossip algorithm can be tuned to improve accuracy at the expense of worse calibration and less reduction of overconfidence. If this trade-off is important for the user, our method provides the knobs to tune it.

We also observe that variants using the Horvitz-Thompson estimator show better accuracy but worse calibration for milder attacks, but similar performance to their average counterpart for stronger attacks. Under moderate to severe attacks, all our variants are still considerably more calibrated than other defenses.

Lastly, we hypothesize that the localized nature of the patch attacks does not entice large shift in the embedding, which is why all defenses which are based on the geometric proximity of the embeddings (k-NN methods and our gossip approaches) cope better than other defenses. We also note that our reduction in overconfidence is still significant, even when compared to these k-NN inspired defenses.

5.3 Further evaluation

In this section, we present further results evaluating the performance of the proposed defence based on different strategies to the radius for Eq. (3) and for building the diffusion matrix \(\Pi _r\). As expressed before, we observe similar results across all configurations, indicating that our proposed method is somewhat robust to variations in the graph used in the diffusion process.

For each type of attack (\(L^\infty\) and LaVAN), we tested two strategies for choosing r: (1) choosing r that ensures a minimum edge density (we targeted density of either 15% or 25% of all possible edges); and (2) ensuring the resulting graph has a Single Connected Component (with and without a potential increase of the computed radius by 10%, which we named “excess radius”). This gives us a total of 4 settings. In each set of plots we include results using both versions of the diffusion matrix (based on Adjancency matrix or Personalized PageRank), as well as employing the Horvitz-Thompson estimator or not.

Figures 6 and 5 display the results for \(L^\infty\) attacks, while Figs. 8 and 7 contain the results for the LaVAN attack. The main behavior we observe is similar across all the tested graphs. Within each strategy, the version with increased connectivity (density of 25% and excess radius of 10%) tend to show slightly lower accuracy for clean images. The increased connectivity would make the diffusion converge faster, leading to more reduced confidence levels, which could explain the lower accuracy for clean images. This drop, however, is not observed for stronger attacks and the improvements in calibration are equally remarkable, no matter what graph was used. Similarly, the versions with PPR diffusion matrix tend to show lower accuracy under mild attacks but also achieve lower calibration error for stronger attacks.

Fig. 5
figure 5

Results for \(L^\infty\) attacks, with graph constructed using Single Connected Component strategy. Top is without excess radius, bottom is with 10% excess radius

Fig. 6
figure 6

Results for \(L^\infty\) attacks, with graph constructed using Edge Density strategy. Top is with target density of 15%, bottom is 25% density

Fig. 7
figure 7

Results for LaVAN attack, with graph constructed using Single Connected Component strategy. Top is without excess radius, bottom is with 10% excess radius

Fig. 8
figure 8

Results for LaVAN attack, with graph constructed using Edge Density strategy. Top is with target density of 15%, bottom is 25% density

6 Conclusions

In this work we showed that existing defenses against adversarial examples, when challenged by progressively stronger attacks, get more confident about their predictions as they make more mistakes (overconfidence). We then developed a new graph-based framework to replace the last fully-connected layer + softmax with a new estimator, whose predictions rely on a gossip graph algorithm that diffuses the label from clean validation examples (nodes) to other such nodes in the graph, collectively reducing overconfidence while preserving the correct marginal distribution of labels after such diffusion.

We showed the effectiveness of our framework in drastically reducing overconfidence when compared against existing defenses, particularly under stronger attacks. Our proposed framework can be easily combined with any existing (and future) adversarial defenses, as long as these produce embeddings which we can use, and we also have access to validation data not used to learn the embedding function. We believe the use of collective graph-based predictions to defend against adversarial examples, and its understanding through stochastic geometry, brings new tools to tackle key challenges in adversarial machine learning.

Limitations: Given an embedding, the diffusion process over the Gilbert graph improves upon existing softmax, tempered scaling, and K-NN outputs. However, the approach proposed in this work depends on how well existing embeddings can cluster images of the same class under adversarial attacks.