1 Introduction

Nowadays countless applications use some kind of anomaly detection, for instance: protection against cyber-attacks, fraud detection, fault prognosis in advanced machinery, identification of artifacts in medical imaging, prevention of critical situations in healthcare, control of cyber-physical and industrial systems, etc.

The meaning of “anomaly” may vary from one application to another, but it usually covers three fundamental semantic senses: abnormality (implying rareness), outlier (or noise), and/or novelty (Ruff et al. 2021). Among the multiple ways to address the challenge of detecting anomalies, in many cases unsupervised learning algorithms are used, more specifically those belonging to the family of outlier detection. In their elementary performance, these algorithms aim at labeling instances as either “normal” or “anomalous”, usually by evaluating their similarity with other instances or with a collection of past instances, which are commonly processed as data points. Even if the final result is expressed as a dichotomy, most algorithms internally estimate a score or probability regarding the quality of outlierness of each data point, as well as a ranking that establishes which data point is the most oulier and which one is the least outlier. In many cases, it is the application itself that requires such a complete outcome; that is, not only knowing if a data point is an outlier, but to what extent it is an outlier.

However, in some cases, it would not even be appropriate to refer the internal scores as “outlierness scores”, since their dynamic characteristics many times cannot be correctly interpreted in terms of “outlier quality” (Kriegel et al. 2011). To give an example: if the score of point i is twice that of point j, that rarely means that its outlierness is twice as large. Moreover, it seems logical to assume that if two algorithms output equivalent binary solutions—or even the same set of ranks—it is because their scores have similar dynamics. Our study refutes this hypothesis.

We can anticipate these intuitions with the simple example of Fig. 1, in which ten points are clustered around the origin, while two outliers appear at (1,1) and (2,2). Here, different outlier detection algorithms (introduced in Sect. 2.3) provide different scores. The outlier (2,2) gets the maximum score by all algorithms, which translates to 1 if normalized. However, the difference to the normalized score of outlier (1,1) ranges from 0 (ABOD, HBOS) to 0.51 (SDO, K-NN). Some of the algorithms give scores that are proportional to the distance to the data bulk (K-NN, OCSVM, SDO), others might suggest estimations closer to probabilities (iForest, LOF, GLOSH, HBOS), and others neither of both options (ABOD).

Fig. 1
figure 1

Simple example of dataset with ten inliers and two outliers. The table shows the outlier scores linearly normalized of eight different algorithms (introduced in Sect. 2.3). Note the differences in the scores, particularly for the outliers in (1,1) and (2,2)

Understanding the dynamic characteristics of the scores means realizing how their distributions depend on both the data and the algorithm used. This can help to infer how the difference between the scores of two points translates into the feature space.

The dynamic characteristics of the scores used by outlier detection algorithms have not received much attention in the literature, even though they have a key impact on the final results and their knowledge in contrast to the nature of the data determines the success and adequacy to the final application. In other words, in addition to facilitating a better interpretation of the scores per se, understanding score dynamics allows us to select the most appropriate algorithm for a particular real-life scenario, refine the criteria for setting thresholds between inliers and outliers, and anticipate how changes in data geometries will affect the scores. In this paper, we study the topic and explore whether algorithms exhibit dynamics of their own independent of the data and how variations in specific data aspects affect such dynamics. Among the different types of anomalies categorized (Chandola et al. 2009), we focus on the most common type, i.e., points anomalies, either global or local. Our study uncovers particular algorithm idiosyncrasies and how each of them shows sensitivity to a different pattern of perturbation. Therefore, the main contributions of this work are:

  • We provide an extensive and critical overview of methods for evaluating outlier detection algorithms beyond accuracy.

  • We propose a set of novel measures to abstract and capture the dynamic behavior of outlier scores. These measures serve to gain a deeper understanding of algorithms and, at the same time, provide information to understand and diagnose the data space as a whole.

  • We study and reveal through sensitivity analysis how established anomaly detection algorithms are perturbed and change their behavior when facing categorized variations in data geometries.

  • We analyze the effect of Gaussian normalization on score dynamics. Gaussian normalization is established in the field as an agnostic method for generating probabilistically interpretable scores (Kriegel et al. 2011).

  • Our experiments disclose knowledge about the studied algorithms that facilitates the selection of the most appropriate alternative according to the data and feature space characteristics.

The rest of the paper is organized as follows: Sect. 2 brings up some relevant works in the field of outlier detection calibration and introduces the studied algorithms. Section 3 discusses traditional evaluations in terms of accuracy and presents new methods of assessing the dynamic characteristics of performances. Section 4 conducts different sensitivity analyses and discusses algorithms’ responses. Conclusions are exposed in Sect. 5.

2 Anomaly detection

2.1 Comparison of algorithms

Numerous studies have discussed and compared existing methods for anomaly detection. They commonly align with two main trends: either exploring the particularities of the problem, its cases, extent and details in different fields of application (e.g., Chandola et al. 2009; Zimek and Filzmoser 2018; Ruff et al. 2021; Boukerche et al. 2020; Blázquez-García et al. 2021); or by conducting exhaustive comparisons between a wide variety of algorithms and datasets (e.g., Campos et al. 2016; Falcão et al. 2019; Domingues et al. 2018; Steinbuss and Böhm 2021; Han et al. 2022).

Some of these surveys also commonly focus on specific fields, for instance, network traffic (Ahmed et al. 2016), urban traffic (Djenouri and Zimek 2018), or medical applications (Fernando et al. 2021). However, in the related literature, evaluations and comparisons are made almost exclusively in terms of accuracy, i.e., to what extent algorithms under test are able to match the ideal partition expressed by a certain Ground Truth. Observations on why some algorithms fail in certain circumstances, or what their weaknesses are, or what properties the data must have in order for the algorithms to work properly, are rather given in papers presenting or describing new methods. Hence, our work is framed within studies that delve deeper into space analysis and aim to reveal the bidirectional effect and interdependency between algorithms and data. Beyond acknowledging that the performance of outlier detection methods depends on the dataset characteristics and preprocessing steps (Kandanaarachchi et al. 2020), we show how data peculiarities differently affects the dynamical behavior of analysis methods.

2.2 Interpretation of anomaly scores

The dynamic characteristics of algorithms are scarcely discussed. Here, the work by Kriegel et al. (2011) is a notable exception, where it is observed that “(s)tate-of-the-art outlier scores are not standardized and often hard to interpret. Scores of objects from different data sets and even scores of objects from the same data set cannot be compared”. To mitigate this inconvenience, Kriegel et al. design a method to regularize and normalize different scores “aiming (i) at an increased contrast between outlier scores vs. inlier scores and (ii) at deriving a rough probability value for being an outlier or not”. This initiative aligns with the work of Gao and Tan (2006), who propose transforming scores into probability estimates in order to facilitate the construction of ensembles. As empirical proof of this concept we could see, e.g., the implementation of an ensemble using scores transformed into probabilities, as well as the evaluation of its suitability (Schubert et al. 2012; Bauder and Khoshgoftaar 2017). Regularized and normalized scores are also used to determine the weight or importance of an object for internal evaluation measures (Marques et al. 2020, 2022).

However, rather than measuring the dynamic behavior of the scores, these studies attempt to overcome their variability or arbitrariness. It is only recently that measures to evaluate the stability of outlierness rankings against perturbations have been proposed by Perini et al. (2020a), thus providing a quantifiable indicator of algorithms’ stability. The same research group, by using the transformation into probabilities approach, also propose a method to estimate the confidence of algorithms in their outlierness scores (Perini et al. 2020b). The concept of generating a confidence score for an instance being anomalous is termed calibrated anomaly detection (Menon and Williamson 2018). Averaging confidence scores is then a way to provide global assessments of the algorithm’s confidence when processing a dataset. However, although very useful, these emerging methods are still limited for explaining the dynamics of outlier scoring algorithms.

2.3 Algorithms for outlier scoring

Algorithms perform different strategies to assign outlierness scores to data points. Behind these strategies lie the principles that shape their dynamic behavior. We can highlight some decisive aspects.

  • Problem space The most traditional and straightforward approach to scoring outliers is by calculating point distances in the input space, therefore deriving an outlierness score that is based on how close or far a given data point is with respect to other data points. It seems logical to think that measurements in the input space—usually Euclidean distances—would be the natural way to face the problem; however, not all feature spaces are consistent with a geometrical space interpretation, in addition to the complications inherent to distance measurements in high-dimensional spaces, i.e., the curse of dimensionality (Zimek et al. 2012; Thirey and Hickman 2015). Hence, some methods transform, project or simplify the input space into a different space that presents some kind of advantage or avoids some of the mentioned disadvantages.

  • Core properties Mixed data sets, hierarchically structured data or an increase in dimensionality are reasons that point distances may be perceived as a suboptimal property on which to solely base outlierness, opting instead for estimates of density, angles between points, similarity measurements, ad hoc functions or properties intrinsic to the space transformation employed. On the other hand, the advantage of distance-based measurements is their easy interpretability and proportionality when mapped to outlierness.

  • Perspective Another key aspect when deciding if a data point is an outlier is to determine compared to what? Most methods take an overall perspective and consider the whole dataset as a reference, whereas a few algorithms focus on the immediate proximity of the data point instead. The second option assumes that the anomaly is in its environment, while the first assumes that it is in the global picture. Both nuances are valid and, in practical cases, not always easy to distinguish (Schubert et al. 2014). Here, it is worth mentioning contextual (or conditional) anomalies, i.e., when they occur outside their usual context, although this requires the existence of features defined as “contextual” (Li and van Leeuwen 2023) or data with a temporal dimension (Hartl et al. 2024). In our work we focus on (local and global) point anomalies. However, some of the algorithms studied are also capable of capturing other types of anomalies.

  • Use of models The last aspect to mention is whether the algorithm uses a model to perform the estimations or, on the contrary, the computation is performed over normal data points. In the second case, all points are basically used only in the theoretical description of the algorithm, while practical implementations tend to reduce operations to the subset of the k-nearest neighbors (usually for computational reasons, but not only). Using models aims to speed up implementation phases, but also to obtain an abstraction that can help explain the data and extract further knowledge.

To investigate the dynamic characteristics of outlier detection scores, we select eight algorithms according to their popularity or novelty, but also looking for operation principles with a distinct basis. The four aspects exposed above serve as guidance to gather representatives (briefly contrasted in Table 1): ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO, and GLOSH. In the following we provide an intuitive introduction while referring the interested reader to the original sources for further details:

Table 1 Key features of the selected algorithms
  • ABOD (angle-based outlier detection) (Kriegel et al. 2008) establishes outlierness scores according to the variance in the set of angles between a given target data point and any other pair of data points in the dataset. ABOD is intended to overcome the limitations of distance-based methods when dealing with high-dimensional spaces. Given that the computation of the set of angles between each data point and every pair of data points exhibits a computational complexity of \(O(n^3)\), in practical implementations, as well as in our experiments, Fast ABOD (FABOD) is employed. FABOD computes angles solely for the kth nearest neighbors.

  • HBOS (histogram-based outlier detection) (Goldstein and Dengel 2012) assumes independence among features. This allows the use of histograms to evaluate the contribution of each data point feature to a final outlierness score. HBOS obviates possible errors and imprecision from presuming feature independence, thus sacrificing accuracy to achieve linear times.

  • iForest (isolation forest) (Liu et al. 2008) works by recursively splitting the feature space until leaving alone (i.e., isolating) a given data point. Outlierness scores are established based on the number of splits required. Compared to inliers, outliers are naturally isolated and, therefore, they are usually found after only a few splits. In iForest, data points are processed in a tree structure based on randomly selected features, thus data are projected into the problem space drawn by the tree. In this random subspace, the tree size needed for isolation can be seen as an estimate of the density.

  • K-NN (kth-nearest neighbors) (Ramaswamy et al. 2000) measures the score of a given point as the distance to its kth-nearest neighbor. This is perhaps the most elementary and straightforward representative option of distance-based algorithms. In Euclidean space, the kth-nearest neighbor distance can also be seen as a density estimate (Zimek and Filzmoser 2018).

  • LOF (local outlier factor) (Breunig et al. 2000) is designed to score outliers according to the isolation of points in their immediate proximity (i.e., locality), in contrast to most methods that take a global view to establish outlierness. LOF operates by comparing the density around a point relative to the density of the points closest to it.

  • OCSVM (one class support vector machine) (Schölkopf et al. 2001) can be seen as a regular SVM that tries to find the hyperplane that better separates data from the origin, therefore adjusting the location of the hyperplane to optimally enclose the data bulk. The outlierness score is evaluated based on the distance and location of a given data point with regard to the hyperplane.

  • SDO (sparse data observers) (Iglesias et al. 2018) is a distance-based method that uses low-density data models. Such models retain normality to speed up the processing and to perform a flexible and application-dependent definition of outlierness, for instance, allowing the detection of anomalous clusters. Outlierness is computed as the average distance to the x-closest points in the model.

  • GLOSH (global–local outlier scores from hierarchies) (Campello et al. 2015) summarizes the input space by means of a density-based hierarchical clustering structure; later, it uses the closest cluster of a given point and the referential density of such cluster to estimate point outlierness. This way, GLOSH is able to overcome the global–local dichotomy and take both approaches into account at the same time.

In this selection, the reader may miss methods based on deep learning, e.g., Mignone et al. (2024). Note that deep anomaly detection has important limitations regarding the amount of data needed for adequate training, interpretability issues and computational resources. Likewise, in purely unsupervised environments they are usually outperformed by shallow approaches (Ruff et al. 2021), being almost exclusively recommended for large datasets where feature extraction is highly complex. For these reasons, we leave the evaluation of methods based on deep learning for future studies. On the other hand, outlier detection ensembles (Zimek et al. 2014) have not been included either, since our objective is to deepen the behavior of algorithms independently (before and after regularized normalization).

3 Evaluation methods

In this section, we first describe the indices normally used to evaluate the performance of algorithms for the detection of anomalies or outliers. We also show internal evaluation alternatives for cases in which a Ground Truth partition is not available. We later present recent approaches for quantifying stability and confidence. In the last part, we propose a series of measurements to assess the dynamic behavior of algorithms and models. Before introducing all these concepts, we establish a general notation that is congruent with the formulations shown in the rest of the paper.

3.1 General notation

We speak generically of a dataset D of N data points. By processing D, a given algorithm creates an internal model H:

$$\begin{aligned} H = {\text{model}}(D) \end{aligned}$$
(1)

where “model” is a function intrinsic to any algorithm. We abuse the terms to include also algorithms that do not use models, thus H identifies the algorithm response when tackling the specific dataset D (henceforth, “model” and “algorithm” and “H” are used interchangeably). Hence, H outputs a set of outlierness scores, which we term S after sorting them in ascending order and undergoing a scaling or normalization than renders them into the [0,1] value range. Such preprocessing will be useful for defining and constraining S-curves and other dynamic indices (Sect. 3.5) and making them comparable to each other. Thus, if \(S = \{s_1,\ldots , s_i,\ldots , s_N\}\):

$$\begin{aligned} \begin{aligned} s_1&= 0, \quad s_N = 1 \\ s_j&\ge s_i \quad \forall \, j>i \end{aligned} \end{aligned}$$
(2)

The arguments of S are, in turn, the outlierness ranks of D with the order reversed. Therefore, for a random point x in D:

$$\begin{aligned} {\text{if score}}(x)=s_i \Longrightarrow {\text{rank}}(x) = N-i+1 \end{aligned}$$
(3)

where “score” and “rank” are functions of H. While ranks are unique for each point, note that several points may have the same score. In such cases, ranks are set arbitrarily between the points involved to break the tie.

By using ranks and scores, D can be finally split into inliers and outliers by fixing a threshold on S or a contamination factor that selects the top-n outliers (with n being an external parameter). When a Ground Truth is available, an external partition offers a benchmark separation of D in a set of outliers O and a set of inliers I, \(D = O \cup I\), this also creating \(S_O\) and \(S_I\), score subsets of S.

3.2 Accuracy indices

Some indices are commonly used in the literature to assess algorithm performance in terms of accuracy. The most popular are:

  • Precision at n (P@n) (Craswell and Robertson 2009) is the proportion of the top-n ranked data points that are actual outliers. It can be expressed as follows:

    $$\begin{aligned} P@n = \frac{|\{x \in O | {\text{rank}}(x) \le n\}|}{n} \end{aligned}$$
    (4)

    Most often, the value of n is set equal to the number of outliers in the Ground Truth. P@n is a straightforward evaluation method, but limited by its strong dependence on n.

  • Average Precision (AP) (Zhang and Zhang 2009) computes Precision for different values of n, from 1 to |O|, giving the mean as the final result:

    $$\begin{aligned} {\text{AP}} = \frac{1}{|O|} \sum _{x \in O} P@ {\text{rank}}(x) \end{aligned}$$
    (5)

    Unlike P@n, AP is a more sophisticated measure that controls the consistency of scores in the top n-ranks. This means that, while P@n does not differentiate between two solutions that place an inlier x in rank\((x)=1\) or in rank\((x)=n\), AP will favor the option with the inlier in the lower position, i.e., rank\((x)=n\).

    However, rather than assessing the congruence of S values globally, both P@n and AP evaluate whether S gives the highest scores to a (usually small) set of outstanding targeted data points. Thus, both coefficients can be unsatisfactory in cases with few outliers and some salient inliers that get low ranks, severely penalizing the evaluation and even making it useless or unrepresentative.

  • The Area under the ROC curve (ROC-AUC) (Hanley and McNeil 1982) is perhaps the best known method to evaluate the performance of classifiers. The ROC curve is obtained by plotting the rate of outliers among the top-n ranks vs. the rate of inliers among the top-n ranks for all possible values of n. The ROC-AUC value can be interpreted as the probability of a random pair formed by an outlier and an inlier being properly ordered in S. Therefore:

    $$\begin{aligned} \begin{aligned} {\text{ROC-AUC}} = \underset{a \in S_O, b \in S_I}{ mean }&{\left\{ \begin{array}{ll} 1 & \quad {if} \; a > b \\ 0.5 & \quad {if} \; a = b \\ 0 & \quad {if} \; a < b \end{array}\right. } \end{aligned} \end{aligned}$$
    (6)

    While they are all valid for estimating the performance of the algorithms, P@n, AP and ROC-AUC present notable differences in the interpretation of the results that can lead to considerable discrepancies in the evaluations. The example in Table  2 highlights this fact.

  • Adjusted indexes Although P@n, AP and ROC-AUC have in common that they take values in the interval [0,1], the interpretation of P@n and AP is strongly conditioned by the rate of outliers. In order to enable de comparison of solutions with different class proportions and performances through different datasets, Hubert and Arabie propose the adjustment for chance (Hubert and Arabie 1985), an agnostic method that can be applied to both P@n and AP (Campos et al. 2016). For instance,

    $$\begin{aligned} {\text{Adjusted AP}} = \frac{{\text{AP}} - |O|/N}{1 - |O|/N} \end{aligned}$$
    (7)

    If \(n \le |O|\), the Adjusted P@n is achieved in the same way but replacing AP with P@n in Eq. 7 (if \(n > |O|\), ‘1’ must be also replaced by the maximum |O|/n). Adjusted indexes must be used to show absolute evaluations detached from dataset peculiarities. Since the ROC-AUC already offers a probabilistic interpretation, it does not require adjustment.

Table 2 Examples of the differences in common measures for the evaluation of anomaly detection

3.3 Internal validation

Compared to the other major branch of unsupervised learning, i.e., clustering, where a wide variety of internal validation methods exist [see, for example, the overview and comparison of Liu et al. (2010)], the intrinsic characteristics of anomaly scoring and ranking make internal validation difficult, and it may seem that validation—if Ground Truth is not available—is only achievable by consensus among algorithms.

There are, however, a few proposals. IREOS (Marques et al. 2020) (Internal, Relative Evaluation of Outlier Solutions) is designed to compare different outlier detection solutions and rank them in terms of quality. Intuitively, IREOS works by calculating a point separability by means of a nonlinear maximum margin classifier. Hence, the classifier will establish separation boundaries between inliers and outliers. Algorithm scores are transformed into probabilities and multiplied by point-separability as weights. To the extent that algorithm scores and point-separability match, the solution will get a higher evaluation than if they disagree.

Goix (2016) proposes two indices for comparing anomaly detection performances, which are based on Excess-Mass and Mass-Volume curves. And, more recently, SIREOS (Marques et al. 2022) has been proposed as an alternative to IREOS in which the separability estimation is replaced by a similarity measure that can vary depending on the nature of the data and the application scenario.

We refer the reader to the original papers for further discussion on internal validation. Since the experiments in Sect. 4 use synthetic data and submit algorithms to specific and highly controlled perturbations, external indices (Sect. 3.2) are here a more straightforward way to reflect accuracy.

3.4 Stability and confidence

As mentioned in Sect. 2.2, a first approach to explain algorithms’ dynamics is outlined by Perini et al. in terms of model stability (Perini et al. 2020a) and score confidence (Perini et al. 2020b).

  • Stability (\(\mathcal{T}\)) The idea behind this measure is that, when processing a certain test dataset, a stable model should show minimal variation if retrained with different samples of the training data. Note that this measure implies a separation between training and test data, as well as the existence of models. In algorithms that do not use models, the model is assumed to be contained in the samples that are used as a reference during retraining (see Sect. 3.1).

    Therefore, for calculating \(\mathcal{T}\), D is separated into \(D_{ train }\) and \(D_{ test }\) splits. Later, R subsets are drawn from \(D_{ train }\) without replacement: \(\{D_{1},\ldots ,D_i,\ldots ,D_{R}\}, D_{i} \subseteq D_{ train }\). By retraining the algorithm with these subsets, a set of models is obtained: \(\{H_{1},\ldots ,H_i,\ldots ,H_{R}\}\). From here:

    $$\begin{aligned} \begin{aligned} \mathcal{T}&= \frac{1 }{|D_{ test }|} \displaystyle \sum _{x \in D_{ test }} \mathcal{T}_{x} \\ \mathcal{T}_{x}&= f\left( \{H_{1},\ldots ,H_{R}\}, \underset{\{H_{1},\ldots ,H_{R}\}}{{\text{rank}}}(x_j) \right) \end{aligned} \end{aligned}$$
    (8)

    Meaning the each data point x in \(D_{ test }\) is evaluated by R different models. \(\mathcal{T}_{x}\)—i.e., the Stability of point x—is set according to how its rank varies depending on the model used. In Eq. 8 this is represented by f(.). We address the reader to the original source for further details about this calculation (Perini et al. 2020a). Finally, the Stability of the overall model is calculated as the average of the data point Stabilities in \(D_{ test }\). Implicit in this idea of stability is accepting:

    $$\begin{aligned} {D}_{ train } \sim D_i \quad \forall D_{i} \subseteq D_{ train } \end{aligned}$$
    (9)
  • Confidence (\(\mathcal{C}\)) Perini et al. define element-wise Confidence as the probability of a data point x to be outlier or inlier given its score s, the dataset size N, the number of expected outliers |O|, and the probability of such scoring \(p_s\). Therefore:

    $$\begin{aligned} \begin{aligned} \mathcal{C}_{x}&= {\left\{ \begin{array}{ll} \mathbb {P}(x \in O | s, N, |O|, p_s) & \quad {if} \; x \in O \\ 1 - \mathbb {P}(x \in O | s, N, |O|, p_s) & \quad {if} \; x \in I \\ \end{array}\right. } \end{aligned} \end{aligned}$$
    (10)

    where \(p_s\) is derived by using Bayes’ theorem to estimate the distribution of outlierness scores. Knowing \(p_s\), \(\mathcal{C}_{x}\) is calculated as the probability (or 1 minus the probability) of obtaining exactly |O| successes in N independent Bernoulli trials. Again, we refer the reader to the original source for further details about the measurement (Perini et al. 2020b).

    Assessing the Confidence level of a point to be an outlier or an inlier has obvious practical applications. Since most points are expected to show high Confidence, in our experiments we estimate Confidence model-wise by setting a reference in the 0.01 quantile of the complete set of element-wise Confidences. Therefore, being \(Q_p(.)\) the quantile function at value p:

    $$\begin{aligned} \mathcal{C} = Q_{0.01}\left( \{\mathcal{C}_{1},\ldots ,\mathcal{C}_{x},\ldots ,\mathcal{C}_{N}\} \right) \end{aligned}$$
    (11)

3.5 Estimating algorithm dynamics

We propose some complementary measurements to assess algorithm dynamics:

  • S-curves A way of estimating the dynamics of algorithms is by comparing plots of normalized and sorted scores. Either by comparing the response of different algorithms for the same dataset, or the same algorithm along different datasets, it is possible to notice algorithms’ dynamic trends. We show some examples of such plots below in Figs. 4, 5, 6, 8, 9, 11, 12, (a) and (c) subplots. In such figures, when plotting scores, we use an x-axis defined for the [0, 1] interval and project the i-index into the x-axis as \(x_i = i/N\).

  • Discriminant power (DP) Empirical distributions of outlierness scores are commonly unimodal, showing a higher concentration of values for inliers, whereas outliers are less numerous and take values spread over a wider range. DP estimates the tendency of S to show outliers and their tendency to assume high values (thus, S is discriminant when DP is high). A proper statistical measure here is the kurtosis (\(\kappa \)), as clarified by Westfall (2014). For an easier interpretation, we define DP by projecting \(\kappa \) with the following transformation:

    $$\begin{aligned} {\text{DP}}= \log (1+\kappa _S) \end{aligned}$$
    (12)

    Following Moor’s interpretation of kurtosis (Moors 1986), DP will tend to be high when most data points are around the mean and some values are far from the mean (i.e., extreme scores) or when many data points are concentrated in the tails of the distribution (i.e., “a measure of dispersion around the two values \(\mu \pm \sigma \)”). On the other hand, DP will tend to 0 for platykurtic distributions (e.g., uniform).

  • Robust coefficients of variation (RCVO and RCVI). The RCV has a similar meaning as the common coefficient of variation (\({\text{CV}} = \frac{\sigma }{\mu }\)), but is more robust to outliers (Arachchige et al. 2022). Robust statistics are more reliable here since we expect skewed distributions and we want to capture representative mass estimations and avoid perturbations due to outliers among inliers and outliers among outliers. We define:

    $$\begin{aligned} {\text{RCV}}_{I} = \frac{\text{MAD}(S_{I})}{\text{Median}(S_{I})}, \qquad \text{RCV}_{O} = \frac{\text{MAD}(S_{O})}{\text{Median}(S_{O})} \end{aligned}$$
    (13)

    where \(\text{MAD}\) is the median absolute deviation. Unlike DP, RCVO and RCVI are external coefficients since they require the Ground Truth to separate \(S_O\) and \(S_I\). Beyond the dependence of these coefficients on the nature of the data, their comparison across different algorithms conveys a sense of the dispersion in score values (wider or narrower ranges) that each algorithm assigns to inliers and outliers.

  • Coherence (\(\gamma \)) This coefficient estimates the overlap between m different zones defined by the Ground Truth (\(m=2\) for binary classification). \(\gamma = 1\) indicates no overlap, while \(\gamma = 0\) denotes maximal overlap in scores that are expected to be separated. Therefore, if \(S = \bigcup _{j=1}^m S_j\):

    $$\begin{aligned} \begin{aligned} a_j&= \min \left( \mu _{S_{j}} + 2\sigma _{S_{j}}, \mu _{S_{j+1}} + 2\sigma _{S_{j+1}}\right) \\&\quad - \max \left( \mu _{S_{j}} - 2\sigma _{S_{j}}, \mu _{S_{j+1}} - 2\sigma _{S_{j+1}}\right) \\ b_j&= \max \left( \mu _{S_{j}} + 2\sigma _{S_{j}}, \mu _{S_{j+1}} + 2\sigma _{S_{j+1}}\right) \\&\quad - \min \left( \mu _{S_{j}} - 2\sigma _{S_{j}}, \mu _{S_{j+1}} - 2\sigma _{S_{j+1}}\right) \\ c_j&= {\left\{ \begin{array}{ll} a_j/b_j & \quad \forall a_j < 0\\ 0 & \quad \forall a_j \ge 0 \end{array}\right. } \\ \gamma&= 1 - \frac{1}{m-1} \sum _{j}^{m-1} c_j \end{aligned} \end{aligned}$$
    (14)

    \(\mu \) and \(\sigma \) stand for the mean and standard deviation respectively; \(a_j\), \(b_j\) and \(c_j\) are ad hoc terms for a more clear reading of Eq. 14. Using \(\mu \pm 2\sigma \) conforms to the Chebyshev’s inequality, which ensures that at least 75% of the elements fall within the interval for most distributions (Saw et al. 1984). If when defining RCVs we used robust statistics to avoid the perturbation of outliers, in this case we aim to include their effect when assessing the potential overlap.

  • Bias (\(\beta \)) shows if the algorithm has a natural tendency to score low or high. We simply estimate the bias as the median of S, therefore:

    $$\begin{aligned} \beta = \text{Median}(S) \end{aligned}$$
    (15)
  • Robustness (\(\varphi \)) Unlike other coefficients defined in this section, which output a value that evaluates a set of scores, \(\varphi \) evaluates a set of sets. Therefore, given a set of sorted and normalized sets of scores \(\textbf{S} = \{A,B,\ldots ,Z\}\), we collect q points from each set by using a sampling period of \(\frac{N_J}{q}\), where \(N_J\) is the size of the subset J. This guarantees that each i-th point of any set occupies an equivalent position in its respective set. Grouping equivalent points we obtain q new sets: \(X_1,\ldots , X_q\) of length \(|\textbf{S}|\). From here:

    $$\begin{aligned} \begin{aligned} \varphi = 1 - \sqrt{\frac{1}{q} \sum _{i=1}^q \frac{\sigma _{X_i+1}^2}{ \max ({X_i+1}) - \min ({X_i+1}) }} \end{aligned} \end{aligned}$$
    (16)

    Intuitively, \(\varphi \) is the mean similarity obtained by comparing each S-curve in \(\textbf{S}\) with the average S-curve of \(\textbf{S}\). Distances between S-curves are calculated with the NRMSE (normalized root mean square error). A \(+1\) offset is added to \(X_i\) to avoid undesired distortions when \(\mu _{X_i} \sim 0\). If S-curves are identical to each other, \(\varphi = 1\), whereas a set with dissimilar S-curves will decrease \(\varphi \).

    The difference between \(\varphi \) and \(\mathcal{T}\) is that, while \(\mathcal{T}\) evaluates the consistency of the rankings obtained by different models on a dataset never seen (but assumed equivalent to those used to train the models), \(\varphi \) does not look at the rankings nor at the scores of specific points, it only evaluates if the dynamics obtained by the model—i.e., the S-curve—varies among datasets.

4 Sensitivity analysis

The objective of the sensitivity analyses is to show how algorithms respond differently and undergo changes in their dynamics when exposed to different types of controlled variations. In some cases, the induced changes should theoretically not imply significant drifts in the value of the scores.

We first describe the experimental setting, then we discuss the dynamic behavior of the algorithms from a global perspective and, finally, we analyze the sensitivity analyses on a case-by-case basis.

Besides this carefully controlled study with synthetic data, in “Appendix 2” we use the same measurements to explore and describe four cases of real data.

4.1 Experimental setup

We test the algorithms introduced in Sect. 2.3. They are set with proper contamination factors and parameters as shown in Table 3.Footnote 1 Algorithms are adjusted to address a broad range of cases (given the experimental conditions), but particularly to solve the baseline dataset. All experiments are available for reuse and reproducibility in a GitHub repository (Iglesias, 2023), and through a DOI-citable repository with data and results already generated (Iglesias, 2024).

The baseline dataset is a globular two-dimensional set of 1000 data points: 970 inliers and 30 outliers. Inliers are enclosed in a sphere of \(r_{in}=0.1\) (r stands for radius). Outliers are spread between spheres \(r_{out_l}=0.1\) and \(r_{out_h}=0.4\). We pursue homogeneous outlierness in both inliers and outliers zones and to cause an abrupt transition between them. To do that, underlying point distributions are uniform with regard to the center of the sphere, as shown in MDCGen (Iglesias et al. 2019) for creating radial-clusters, with the correction by Ojdanić (2019), which first creates unit vectors with Gaussian (instead of uniform) distributions, and later multiply them for module values set with uniform distributions. The baseline dataset is shown in Fig. 2. Taking this dataset as a reference, we build the following sets by varying a specific property each time (Table 4 summarizes all configurations):

  1. a.

    Number of data points The number of data points increases (up to 82,000), but the outlier ratio remains the same.

  2. b.

    Number of dimensions The number of dimensions increases (up to 83).

  3. c.

    Proportion of outliers The percentage of outliers increases (from 1 to 19%). The sphere that contains outliers is also proportionally expanded to keep a low density.

  4. d.

    Relative outlier/inlier densities Always with the 10% outliers rate, the sphere of inliers is expanded, whereas the sphere of outliers is reduced. Thus, the difference of density between both zones is also reduced.

  5. e.

    Number of density layers Data points are spread in layers with a marked density difference. The first sphere (\(r=0.1\)) encloses inliers (52.6% of data points), while remaining data points occupy a larger sphere (\(r=1.1\)) distributed in a variable number of layers. In these datasets the total number of data points is 10,000 to allow up to 11 layers.

  6. d.

    Number of clusters The generation of the baseline dataset is repeated up to 10 times, each time the given dataset placed in a distant space location in which overlap with other clusters is avoided. The total number of data points increases consequently.

  7. f.

    Local outliers The sphere that contains outliers disappears and, instead, some inliers are transformed into local outliers by removing their 15 closest neighbors and relocating them in denser areas. Local outliers account from 1 to 19%.

Table 3 Most relevant configuration parameters used
Fig. 2
figure 2

Baseline dataset used in the experiments. Outliers are shown in orange and inliers in blue

Table 4 Description of dataset configurations for the sensitivity analysis. \(i \in \{0,1,2,3,4,5,6,7,8,9\}\)

Figure 3 plots random examples of the 7 variations. In order to evaluate the effects of the normalization in the dynamics, experiments are repeated according to two different setups:

  1. a.

    By performing linear normalization on the scores. \(s_i|_{l-norm}\) being the linear-normalized version of the \(s_i\) score:

    $$\begin{aligned} s_i|_{l-norm} = \frac{s_i - \min (S)}{\max (S) - \min (S)} \quad \forall s_i \in S \end{aligned}$$
    (17)
  2. b.

    By performing Gaussian normalization, and thus transforming scores into probability estimates. As proposed in Kriegel et al. (2011), this can be achieved by means of the Levenberg-Marquardt method, which is easily calculated with the mean and standard deviation of the scores (\(\mu _S\) and \(\sigma _S\) respectively) and the Gaussian error function erf(). Therefore, \(s_i|_{g-norm}\) being the Gaussian-normalized version of the \(s_i\) score:

    $$\begin{aligned} s_i|_{g-norm} = \max \left\{ 0, erf\left( \frac{s_i - \mu _S}{\sigma _S \sqrt{2}} \right) \right\} \quad \forall s_i \in S \end{aligned}$$
    (18)

    Exceptionally, as recommended in Kriegel et al. (2011), ABOD scores are previously regularized for a better contrast.

    $$\begin{aligned} s_i|_{ABOD-reg} = - \log \left( s_i/\text{max }(S)\right) \quad \forall s_i \in S \end{aligned}$$
    (19)
Fig. 3
figure 3

Examples of datasets in the sensitivity analysis. Except for case (e), outliers are shown in orange and inliers in blue

4.2 Overall performances

Figures 4, 5, 6, 7, 8, 9, 10, 11 and 12 show a collection of S-curves for the studied algorithms as well as scatter plots with the dynamics indices described in Sect. 3.5 plus the Adjusted AP (AAP) and the ROC-AUC introduced in Sect. 3.2, and the Stability (\(\mathcal{T}\)) and Confidence (\(\mathcal{C}\)) from Sect. 3.4. Additionally, in the “Appendix 1” we show tables with maximum and minimum values corresponding to these figures (from Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19).

Fig. 4
figure 4

Sensitivity analysis on the number of data points. Colors map to i values in Table 4

Fig. 5
figure 5

Sensitivity analysis on the number of dimensions. Colors map to i values in Table 4

Fig. 6
figure 6

Sensitivity analysis on the percentage of outliers. Colors map to i values in Table 4

Fig. 7
figure 7

Outliers (orange) and inliers (blue) in the Effect of Higher Outlier Ratios scenario, \(i=8\) (11% outliers). GLOSH performance compared with Ground Truth

Fig. 8
figure 8

Sensitivity analysis on the density difference between inliers and outliers. Colors map to i values in Table 4

Fig. 9
figure 9

Sensitivity analysis on the number of density layers. Colors map to i values in Table 4

Fig. 10
figure 10

Outlierness scores for OCSVM in the Sensitivity to Multiple Density Layers scenario, \(i=9\) (11 density layers). Comparison between linear and Gaussian normalized

Fig. 11
figure 11

Sensitivity analysis on the number of clusters. Colors map to i values in Table 4

Fig. 12
figure 12

Sensitivity analysis on the percentage of local outliers. Colors map to i values in Table 4

Before analyzing the effect of the different tests, it is worthwhile to disclose overall insights as well as identifying characteristic S-curves. We conducted experiments twice, one for each type of normalization defined: linear and Gaussian, in both cases fitting the range of values [0,1]. Note that scores, when linearly normalized, show the pure dynamics of the algorithms. However, with Gaussian normalization, scores are modified to allow a probabilistic interpretation and the intrinsic dynamics outputted are altered under this transformation. Regardless of these two situations, a first noticeable finding is the correlation between the Robustness index (\(\varphi \)) and the variability of the curves, this confirming \(\varphi \) as a suitable method for automatically assessing changes in algorithm’s dynamics in the face of data perturbations.

4.2.1 Dynamics of linearly normalized scores

Based on the characteristic shapes observed in linearly normalized S-curves, we can split algorithms into four groups:

  1. (a)

    Backward L-shape, which draws abrupt transitions (almost right angles) between inliers (majority) and outliers (minority). In this group we find: K-NN, LOF, OCSVM and SDO.

  2. (b)

    J-shape, in which transitions between outliers and inliers scores are softer, describing a rounded curve reminiscent of an exponential function. This is characteristic of GLOSH.

  3. (c)

    Raised J-shape, in which transitions are even softer and inlier scores grow linearly with a considerable slope. Here we find HBOS and iForest.

  4. (d)

    Upside down L-shape, in which the transition is also very abrupt, but high scores are majority, therefore there is no overall connection between the difference in scores and the difference in the outlier quality of two points taken at random. This behaviour is only shown by ABOD.

Whether the S-curve shows a steeper or smoother evolution correlates with the Discriminant Power and Bias of the algorithm. Algorithms with abrupt dynamics (i.e., prone to L-shapes) are more discriminant (DP) and biased (\(\beta \)) towards extreme values (either 0 or 1). However, the characteristic S-curve is not related to the accuracy performance of the algorithm (AAP, ROC-AUC), neither is the Stability (\(\mathcal{T}\)) and the Confidence (\(\mathcal{C}\)), which are more related to the data scenario (yet some algorithms are more prone to take lower \(\mathcal{T}\) and \(\mathcal{C}\)). A S-curve moving away from its characteristic shape does not necessarily imply a deterioration in accuracy (AAP, ROC-AUC), although it is usually indicative of it. Nevertheless, in some cases GLOSH seems to benefit when it breaks away from its expected dynamics. On the other hand, an algorithm exhibiting its natural S-curve does not necessarily imply satisfactory accuracy.

The type of characteristic S-curve does not influence algorithm’s Coherence (\(\gamma \)), but moving away from it usually implies lower \(\gamma \). The Coefficients of Robust Variation (\(\text{RCV}_{I}\) and \(\text{RCV}_{O}\)) are not related to the type of characteristic S-curve, but are intrinsic to the algorithm; for instance, GLOSH shows the highest variance when scoring inliers (\(\text{RCV}_{I}\)), followed by OCSVM, and later HBOS and iForest, while the lowest \(\text{RCV}_{I}\) is shown by ABOD, followed by LOF, K-NN and SDO. The variance in outlier scores (\(\text{RCV}_{O}\)) is more irregular and scenario dependent, but it is usually high in OCSVM and SDO, also in LOF (particularly if Gaussian normalized). The convenience of high or low Coefficients of Robust Variation is not easy to establish. While high coefficients may indicate a tendency to lower Coherence, Stability or Confidence, also, if the accuracy is high, they indicate a greater richness and granularity in the scores. Therefore, \(\text{RCV}_{I}\) and \(\text{RCV}_{O}\) should be evaluated alongside other indices.

S-curves of ABOD are the most striking since they show an opposite or inverse form to other algorithms. While other algorithms are naturally biased towards low scores (\(\beta \approx 0\)), ABOD shows Bias towards high values (\(\beta \approx 1\)). The S-curve of ABOD is also very sharp (square angle) and show little variation regardless of the dataset. The peculiar shape and characteristics of ABOD are explained by the fact that the extreme values do not correspond to outliers but to inliers. ABOD does not only depend on angles, but weighs the contribution of angles by dividing them by the distance between points. Some few inliers in central areas with respect to their k-nearest neighbors with very close or nearly overlapping neighbors generate extremely low outlierness values. Such cases appear in all the datasets examined, always causing a similar alteration in the S distribution. Note the repeated constellation of indices that perfectly explains the dynamic behaviour of ABOD when linearly normalized, namely: \(\beta \approx 1\), \(\varphi \approx 1\), RCVO \( \approx 0\), RCVI \(\approx 0\), \(\gamma \approx 0\), AAP \(\approx 1\), ROC-AUC \(\approx 1\). In addition to the Bias towards one (\(\beta \approx 1\)), the high Robustness (\(\varphi \approx 1\)) indicates the habitual presence of such disturbing extreme inliers, which are strongly discriminated (highest DPs). Both low RCVO and RCVI express that most inliers and outliers move in small ranges of values. High ROC-AUC and AAP inform that ABOD ranks are anyway accurate and reliable, but the slight decrease in Stability (\(\mathcal{T}\)) discloses that the strong scores of the extreme inliers tend to be artifacts (or artificially inflated), since they may disappear due to subsampling, hence lowering \(\mathcal{T}\).

Of the algorithms in the backward L-shape group, S-curves and dynamic indices identify K-NN and SDO as the most similar. This can be justified from the perspective that they both work with point distances directly in the input space, the difference being that while K-NN takes the k-nearest neighbors as a reference, SDO uses a model. The profile of methods that calculate distances directly in the input space is properly grasped by the defined indices: general high Coherence (\(\gamma \)), intermediate-high DP, medium and similar RCVO and RCVI. These indices indicate smooth dynamics, sufficiently resolute and with progressive transitions between inliers and outliers. The main difference between them is that, while K-NN is more Coherent, SDO is more Discriminant. OCSVM might also be deemed similar to K-NN and SDO, but with larger ranges of values in RCVI and particularly in RCVO, and with different dynamics in S-curves as a function of the induced changes. The higher RCVO and RCVI capture an increased ability or resolution in OCSVM to perceive subcategories within groups of inliers and outliers.

Together with GLOSH, LOF is the algorithm that shows the greatest sensitivity in the different tests performed, as well as the most affected in its accuracy. It is important to clarify here that, except for one sensitivity analysis, outliers in our experimental data are generated to be global and not local. In any case, LOF and GLOSH manifest sharp drops in ROC-AUC and AAP, which can be inferred from the significant drops in Coherence and Stability (\(\gamma \) and \(\mathcal{T}\) respectively). That is, low Coherence together with low Stability is a symptom of low accuracy.

Finally, S-curves from the raised J-shape group show less abrupt growth and less defined bends. This is also captured by the Bias (\(\beta \)) slightly above 0, the lower DP, and the indicative RCVO \( \gg \) RCVI, meaning little-emphasized outliers and increased granularity in inliers.

4.2.2 Dynamics of Gaussian normalized scores

Gaussian normalization induces remarkable variations in the scores. One of the most immediate effects is to bring the Bias (\(\beta \)) to 0 (or close to 0) for all algorithms regardless of the scenario. Possible high Bias values may identify a degradation of accuracy (HBOS, LOF), but not necessarily (ABOD, GLOSH). The equalizing effect of Gaussian normalization tends to reduce the variations in the scores, i.e., it makes scores more equal each other. This effect can also be observed in the RCVI, mostly undefined in all cases due to the large population of inliers now scoring 0 (denominator of the Eq. 13, left). The variance reduction is also remarkable in the normalized outlier scores (RCVO). Not surprisingly, in the process of interpreting scores as probabilities, the Gaussian normalization forces 0 in all or a large majority of inliers, as well as 1 in the most obvious outliers, leaving the range of intermediate values for a reduced number of data points. Consequently, the Coherence (\(\gamma \)) also becomes more stable and closer to 1 as a general trend. Here, a decrease in Coherence commonly suggests low values of accuracy. It also affects the Discriminant Power (DP), which drops similarly for all algorithms. In ABOD, the DP drop is severe due to the removal of the extreme inliers impact; thus, the interpretation of DP becomes consistent with the rest of algorithms and places ABOD in the group with low DP. Due to its natural S-curve, ABOD provides far less strictly-0 inliers than other algorithms if Gaussian normalization is used, this causing Low Confidence (\(\mathcal{C}\)). On the other hand, Gaussian normalization usually has no (or negligible) effect on Stability (\(\mathcal{T}\)) and Confidence (\(\mathcal{C}\)) measurements compared to linear normalization.

When focusing on S-curves, we can abstract a generic profile by recognizing the following parts: a first segment of scores with value equal to 0 (strong inliers), a second part where scores increase linearly (or in almost linear curves), and a last part with scores equal to 1 preceded by an abrupt jump (strong outliers). From here we can distinguish two groups:

  1. (a)

    Early rise, which accounts for ABOD, HBOS and iForest.

  2. (b)

    Late rise, which accounts for K-NN, LOF, OCSVM and SDO. GLOSH is also included in this group, but it is slightly different as it does not show a last abrupt jump before the 1-plateau of strongest outliers.

The main difference in these groups is that the algorithms with early rise generally have lower DP than those with late rise.

In short, Gaussian normalization democratizes algorithms by making their scores and dynamics considerably more similar and comparable (less dependent on the algorithm used), although at the cost of a notable reduction in the variance of scores and suppressing dynamic nuances that might be used for the diagnosis of the data context and the algorithm performance (see “Appendix 2”). In fact, transforming raw scores into probability estimates does not guarantee that these new probability scores are fully well defined or reliable either (Röchner et al. 2024). Therefore, assumed the loss of information and a possible distortion due to the statistical modeling, Gaussian normalization is advisable in most practical cases, since it largely decouples the score from the algorithm. The user should keep in mind that, compared to “late rise”, “early rise” algorithms are more prone to score higher uncertain or intermediate points, thus making scores less discriminant overall.

4.3 Effect of larger datasets

The first sensitivity analysis consists of increasing the number of data points while keeping invariant the percentage of outliers and the zones of the input spaces where inliers and outliers appear. Among the cases studied, this is the one with smallest effect on score dynamics.

Figure 4 shows S-curves and plots with comparisons of the studied indices and algorithms. All S-curves are commonly stable in the event of an increase in the number of data points. Considering all sensitivity analyses together, the model-based algorithms (OCSVM, SDO, iForest) and the algorithms involving samplings (iForest, ABOD) are those that show larger Robustness (\(\varphi \))—therefore, visibly less variable S-curves. Since the distribution of the dataset does not change, the model produced by the algorithms should be similar regardless of the increase in the dataset size. For example, the boundaries produced by the OCSVM should be placed in similar positions, consequently producing similar S-curves. Similarly, the algorithms involving samplings should produce similar S-curves with similar computational time since these algorithms operate over the same sampling size.

The algorithms most affected by the increasing of the dataset size are those that do not produce models: K-NN, HBOS, LOF, and GLOSH. The global algorithms, however, have the accuracy less affected. While K-NN has no affect on the accuracy, HBOS presents an increase in AAP with the increase of the dataset size.

The local algorithms (LOF and GLOSH) are those most affected by the increase in dataset size. This is clearly perceived in the decrease of the ROC-AUC, AAP and Confidence (\(\mathcal{C}\)) measurements. The deterioration observed in LOF and GLOSH as dataset size increases can be attributed to the increase in the number of outliers within the dataset. Both algorithms operate by defining a reference set for each data point, and the outlier scoring is given by contrasting the given data point with others within its reference set. With the increasing in the number of outliers in the dataset, it increases the number of outliers in the reference set of the other outliers. In the extreme case, where all the data points in the reference set of the outliers are also outliers, relatively the outliers vary as much as the inliers vary compared to the data points in their reference set. In this case, the local outliers are data points that acquire a slight local isolation due to the random generation (although this is not consistent with the Ground Truth). This explains the values of ROC-AUC close to 0.5 and AAP close to 0, which is expected for random solutions.

Particularly interesting is to observe that larger data affects the Stability (\(\mathcal{T}\)) of LOF, but not in GLOSH, while it affects the Coherence (\(\gamma \)) of GLOSH, but not that of LOF when linearly normalized. Stability (\(\mathcal{T}\)) in LOF is affected as these slight local differences vary in each model generated by \(\mathcal{T}\) sampling. This is not the case in GLOSH, where the outlierness is computed with reference to a clustering that is minimally affected by \(\mathcal{T}\) sampling. As the dataset size increases, possible irregular distances among points become less extreme in absolute values and frequency. Hence, since GLOSH also maintains a global perspective, outliers are predicted less outliers, and the differences in scores between outliers and inliers are reduced. This is detected by the Coherence (\(\gamma \)) drop. \(\gamma \) drops in LOF only when Gaussian normalized, since LOF is not affected by the global perspective and therefore keeps the overlap between outliers and inliers scores lower if linearly normalized.

Also, it is interesting to note the general trend towards higher Confidence (\(\mathcal{C}\)) the larger the data size. This is statistically consistent and indicates that algorithms tend to be more doubtful for smaller datasets. We might see an effect of weaker contrast in density variations when the sample size is larger, as discussed in Zimek et al. (2013).

4.4 Effect of dimensionality

This sensitivity analysis consists on increasing the number of dimensions. Figure 5 shows S-curves and plots with comparisons of the studied indices and algorithms.

Due to the effects associated to the curse of dimensionality, the distance and density based algorithms are the ones most affected, while the impact on the others such as iForest and ABOD is slight, and almost imperceptible in OCSVM. HBOS, however, increases the number of histograms used and, therefore, the granularity of its scores, also making the S-curve smoother and more stable. Although it does not affect accuracy, in the S-curves of K-NN and SDO the effect of data points becoming sparse and the distances between them more similar to each other is clearly displayed. Thus, the elbow angle of the S-curves progressively opens away from the square angle and presents a more gradual evolution. This also causes the loss of DP. Finally, increasing the dimensions has a dramatic impact on LOF, which is disturbed by the leveling of local densities in an increasingly sparse space.

Particularly interesting is the effect on GLOSH, which, as the dimensions increase, tends to shift the Bias (\(\beta \)) towards high values; this means that more data points are seen as outliers (note also the change in the S-curve and RCVI). This affects minimally the ranking, since AAP and ROC-AUC remain high.

4.5 Effect of higher outlier ratios

Increasing the proportion of outliers has the expected effect of advancing the start of the abrupt rise of S-curves when linearly normalized. In most cases it also decreases the level of the plateau. Figure 6 shows S-curves and plots with comparisons of the studied indices and algorithms. Recall that in this group of datasets \(i=0\) does not generate the baseline dataset (the most similar is generated with \(i=2\)); instead, a dataset with 1% outliers is obtained. Increasing the outlier rate has usually the effect of decreasing DP. This is because a larger number of outliers spread over a similar space in turn decreases the highest scores. In short, there are more outliers, but they are less outlying. The fact of having more outliers increases the \(\mu _S\) used in the Gaussian normalization, resulting in a higher numbers of data points with probability equal to 0. By setting the probability of many data points to 0, Gaussian normalization alters the naturally occurring changes and the S-curves come to show an opposite effect. The inflation of \(\mu _S\) also results in a decrease of the ROC-AUC for LOF and OCSVM when compared to the linear normalization, as inflated value of \(\mu _S\) makes some of the outliers receive probability equal to 0 too. Note also how a higher proportion of outliers usually implies lower Confidence (\(\mathcal{C}\)); that is, outlier scores tend to be more dubious.

In general, algorithms adapt well to the increase in outlier rate without suffering major issues in either accuracy or dynamics. Notable is the effect on RCVI of HBOS. Again, accuracy metrics in LOF and GLOSH (ROC-AUC and AAP) drop when the outlier rate increases. This is again explained by the increase in the number of outliers in the datasets, which makes some outliers have only other outliers in their reference set, relatively varying as much as the inliers compared to the data points in their reference set. Note that, although the percentage of outlier in this scenario is higher, the absolute number of outlier is smaller. Here it goes up to 190 outliers, while in the scenario where it increases the dataset size it goes up to 2460 outliers, therefore, the effects there is more severe.

4.6 Effect of variable outlier/inlier density differences

Figure 8 shows S-curves and plots with comparisons of the studied indices and algorithms for the reduction of difference in density between inliers zone and the outliers zone. In general, the effect on the S-curves is slight and smoother than in Sect. 4.5. In contrast, the plateau rises (and/or slopes) as the density difference between the two zones is smaller. The reduction of difference in density between the inliers zone and the outliers zone tends to reduce DP specially for the global algorithms. Note that there is a clear correlation between the decrease of density difference and the DP for HBOS, iForest, K-NN, LOF, OCSVM and SDO. While for the local algorithms (LOF and GLOSH) this correlation is not clear. This is a logical consequence of reducing the dissimilarity between inliers and outliers in terms of outlierness. OCSVM is the algorithm that shows the greatest sensitivity in its dynamics when facing this type of change. In addition to ABOD when linearly normalized, which remains almost unchanged in all analyses due to the effect of extreme inliers, the algorithm that proves to be more robust is LOF. On the other hand, the smaller difference in densities shows in HBOS the most negative impact in terms of accuracy due to the reduced ability of its internal histograms to establish adequate thresholds. Such disturbance is difficult to detect in dynamic measurements, although if Coefficients of Variation RCVO and RCVI, when linearly normalized, tend to approximate each other, this usually indicates lower accuracy.

4.7 Sensitivity to multiple density layers

This group of tests always considers a core of high-density central points that comprises 52.6% of the data, while the remaining 47.4% is distributed in between 1 (\(i=0\)) and 10 (\(i=9\)) concentric layers of lower density. Figure 9 shows S-curves and plots with comparisons of the studied indices and algorithms. For most algorithms, smoothing the distribution of points in their radial expansion has the effect of increasing the Discriminant Power (DP) and reducing the Coherence (\(\gamma \)), leading in turn to lower accuracy (AAP and ROC-AUC), i.e., the outlierness of the points becomes more difficult to differentiate. The reverse is true for GLOSH and LOF, where smoothing the distribution gradient means that the variance in density among points within the reference set of outliers is likely to be greater than that within the reference set of inliers, thus improving accuracy as it increases the number of density layers for the outliers. It is worth remembering that all algorithms always use the parameters defined for the baseline case (Fig. 2); that is, the parameters in GLOSH and LOF are adjusted for a dataset with a number of points 10 times smaller. In Sect. 4.3 we showed that the number of data points has a strong impact on GLOSH and LOF, both requiring a hyperparameter adjustment.

On the other hand, the algorithm that best copes with the existence of multiple density layers is OCSVM when linearly normalized, being the only one that both maintains high accuracy scores as well as high Coherence (\(\gamma \)). This means that scores are congruent and correlate with the different densities, presenting also high RCVO. Note that in this sensitivity experiment the type of normalization affects accuracy metrics ROC-AUC and AAP. This effect can be explained by the same effect discussed in Sect. 4.5, wherein an elevated number of outliers augments the parameter \(\mu _S\) employed in Gaussian normalization. Consequently, this leads to an increased prevalence of data points with a probability of 0. It also severely affects Confidence (\(\mathcal{C}\)) due to the reduction in dynamic nuances caused by Gaussian normalization (an example is shown in Fig. 10). We should remark here that an application where a large proportion of points (47.4%) are spread in layers of lower density, although possible, is challenging and at the borderline of potential interpretations of “anomaly” and “outlier”, especially when final binary classifications are to be performed.

Lastly, we highlight that this type of data perturbation particularly affects the dynamics of K-NN and SDO, noticeable in S-curves and Robustness (\(\varphi \)) values.

4.8 Effect of zonification

The appearance of clusters in different zones of the input space has no impact on the dynamics of algorithms such as ABOD, K-NN, LOF or SDO, or negligibly impact beyond what is caused by increasing the size of the dataset. It does, however, affect iForest and HBOS, even in accuracy in both cases. Figure 11 shows S-curves and plots with comparisons of the studied indices and algorithms. The proliferation of clusters poses a challenge to the way outlierness is estimated in both iForest and HBOS. It can benefit or harm HBOS depending on the relative position of clusters, which might remain poorly seen during the establishment of outlierness scores since features are computed independently (this problem tends to disappear as the number of dimensions increases). For iForest, the existence of clusters reduces the isolation of the points globally, so it tends to increase the Bias and reduce the DP. Finally, an algorithm directly invalidated by the existence of clusters is OCSVM, which is penalized by the non-homogeneity of the inliers and the appearance of new clusters only makes it more difficult to find enclosing hyperplanes. It is noteworthy that, for OCSVM the optimal performance is observed exclusively for the dataset with a singular cluster (i = 0). For datasets exhibiting multiple clusters, the algorithm demonstrates performance akin to random, as all data points are likely encompassed by a single hypersphere. In this scenario, OCSVM hyperparameter kernel should be properly configured to transform the space into a linearly separable problem.

4.9 Local outliers

The process to generate local outliers in this set of experiments is by adding holes in the inliers area of the baseline dataset and then adding outlier-labeled points in the center of the holes. These types of scenarios are complicated to solve even for algorithms of a more local nature; moreover, the generation process is not free of noisy labeling. Figure 12 shows S-curves and plots with comparisons of the studied indices and algorithms. The dynamics of the S-curves are relatively stable and improve significantly if Gaussian normalization is applied, also Robustness (\(\varphi \)) is maximized. As a general rule, accuracy (ROC-AUC) and Stability (\(\mathcal{T}\)) tend to decrease as the number of local outliers increases, while the variability in the outlier scores (RCVO) increases. Scenarios with exclusively local outliers tend to show higher Bias (\(\beta \)) and lower Coherence (\(\gamma \)), although ABOD, LOF and SDO maintain relatively high \(\gamma \) when the rate of outliers is low. Discrimination power (DP) is also particularly low, except for algorithms designed to have a local character (LOF and GLOSH), since all other algorithms assign only slight outlierness to local outliers.

4.10 Interpretation and interdependence of indices

A first key remark disclosed by previous analyses is that the dynamics of the scores are altered by changes in the data in different ways depending on the algorithm, but we also noted that this dependence on the algorithm is considerably reduced if Gaussian normalization is applied (which, as a counterpart, implies a certain loss of information in the scores). A second insight is that, to obtain a consistent interpretation of what the oulierness scores are expressing in a particular case, it is necessary to study the indices together.

In any case, it is worthwhile to globally assess the interdependence of indices, as well as to reveal some clues that they suggest when considered individually. Figure 13 shows the correlation of all indices in the experiments performed, separated into two groups according to the normalization. Comparing both cases, we find dependencies that remain in both cases and remarkable differences induced by the type of normalization. Some striking insights are:

  • As expected, the ROC and AAP accuracy indices show a high positive correlation.

  • Also the Coherence (\(\gamma \)) shows a strong positive correlation with ROC and AAP, suggesting that a smaller overlap between the ranges of inlier and outlier scores is indicative of a greater ability of these scores to discriminate outliers and inliers correctly.

  • \({\text{RCV}}_I\) is the index that is most affected by the normalization change. We remember here that Gaussian normalization forces a large majority of outliers to take the value of strict 0; hence \({\text{RCV}}_I\) is not defined in many cases (division by 0) and, therefore, its correlation measures in Fig. 13 are not representative.

  • Bias (\(\beta \)) and \({\text{RCV}}_I\) show a strong inverse correlation. This indicates that a higher variance in inliers’ scores implies a higher average outlier score in the whole dataset.

  • A higher DP is associated with a lower variation in the scores of the inliers (RCVI) when the normalization is linear. This is a consequence of the crushing effect that extreme outliers exert over the rest of the scores, especially on inliers, which take values close to 0 with reduced variance. The DP—therefore linked to the strength of outliers—also tends to show also a positive correlation with \(\gamma \), AAP and ROC.

  • Perini’s Stability (\(\mathcal{T}\)) and Confidence (\(\mathcal{C}\)) do not show strong correlations with other indices. This fact indicates that, as defined, indices are not redundant or, in other words, it is not possible to make general predictions about Stability and Confidence based on the dynamics and accuracy of outlierness scores.

Fig. 13
figure 13

Pearson correlation between indices in the experiments performed

Finally, Table 5 shows some rough estimations of what the proposed new indices suggest when considered independently.

Table 5 General hints to interpret the proposed indices when taken individually

5 Conclusions

In this paper we have presented novel ways to visualize, evaluate and explain the dynamic response of anomaly detection algorithms. By dynamic response we mean, not only how the accuracy is affected, but how the algorithm changes its way of interpreting and judging the outlierness of a data point. Thus, our proposals measure performances in terms of characteristic shapes, Coherence, Robustness, Coefficients of Variation, Bias and Discrimination Power.

With these methods we have studied eight main anomaly detection algorithms: ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH, under linear and Gaussian normalization. We have conducted sensitivity analyses focused on a set of specific variations, namely: (a) increase in the number of data points, (b) increase in the number of dimensions, (c) increase in the outlier rate, (d) variations in the density differences between inliers and outliers, (e) increase of areas with disparate densities, (f) clustering of the input space, and (g) proliferation of local outliers. Additionally, four cases of real application data are described with the proposed measurements in the “Appendix 2”.

We have seen how algorithms have ways of establishing outlierness (S-curves) that are intrinsic to the algorithm and not that much to the data. We have also observed how each algorithm responds differently to distinct perturbations. The same change in the data can severely affect one algorithm while not altering the behavior of another. In this regard, Gaussian normalization reduces the effect on score dynamics due to the selected algorithm, but does not avoid or resolve accuracy perturbations caused by data peculiarities that affect the specific algorithm. Gaussian normalization is therefore advisable for its benefits in the score interpretation as long as the associated loss of information is assumed and can be tolerated in the specific use case. When outlier scoring or detection is not the last step of the processing pipeline, raw (or linearly normalized) scores—and even more the combination of raw scores from different algorithms—may provide richer information about the input space and even help detect algorithm malfunction or inability to deal with the input space.

In a nutshell, studying the dynamic response of algorithms and visualizing how they understand and establish outlierness is key for the correct interpretation of the scores as well as for deciding which algorithm is best suited to the requirements of real applications.

6 Supplementary information

All experiments are available for reproducibility and replication in the Github repository: https://github.com/CN-TU/py-outlier-detection-dynamics, and through a DOI-citable repository with data and results already generated: Iglesias Vazquez, F. (2023). Key Characteristics of Algorithms’ Dynamics Beyond Accuracy - Evaluation Tests (v2). TU Wien. https://doi.org/10.48436/qrj8h-v7816.