What do anomaly scores actually mean? Dynamic characteristics beyond accuracy

Félix Iglesias Vázquez¹,
Henrique O. Marques²,
Arthur Zimek² &
…
Tanja Zseby¹

386 Accesses
1 Altmetric
Explore all metrics

Abstract

Anomaly detection has become pervasive in modern technology, covering applications from cybersecurity, to medicine or system failure detection. Before outputting a binary outcome (i.e., anomalous or non-anomalous), most algorithms evaluate instances with outlierness scores. But what does a score of 0.8 mean? Or what is the practical difference compared to a score of 1.2? Score ranges are assumed non-linear and relative, their meaning established by weighting the whole dataset (or a dataset model). While this is perfectly true, algorithms also impose dynamics that decisively affect the meaning of outlierness scores. In this work, we aim to gain a better understanding of the effect that both algorithms and specific data particularities have on the meaning of scores. To this end, we compare established outlier detection algorithms and analyze them beyond common metrics related to accuracy. We disclose trends in their dynamics and study the evolution of their scores when facing changes that should render them invariant. For this purpose we abstract characteristic S-curves and propose indices related to discriminant power, bias, variance, coherence and robustness. We discovered that each studied algorithm shows biases and idiosyncrasies, which habitually persist regardless of the dataset used. We provide methods and descriptions that facilitate and extend a deeper understanding of how the discussed algorithms operate in practice. This information is key to decide which one to use, thus enabling a more effective and conscious incorporation of unsupervised learning in real environments.

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

Article 16 January 2016

Outlier Detection Techniques: A Comparative Study

A Survey on Anomaly Detection Strategies

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays countless applications use some kind of anomaly detection, for instance: protection against cyber-attacks, fraud detection, fault prognosis in advanced machinery, identification of artifacts in medical imaging, prevention of critical situations in healthcare, control of cyber-physical and industrial systems, etc.

The meaning of “anomaly” may vary from one application to another, but it usually covers three fundamental semantic senses: abnormality (implying rareness), outlier (or noise), and/or novelty (Ruff et al. 2021). Among the multiple ways to address the challenge of detecting anomalies, in many cases unsupervised learning algorithms are used, more specifically those belonging to the family of outlier detection. In their elementary performance, these algorithms aim at labeling instances as either “normal” or “anomalous”, usually by evaluating their similarity with other instances or with a collection of past instances, which are commonly processed as data points. Even if the final result is expressed as a dichotomy, most algorithms internally estimate a score or probability regarding the quality of outlierness of each data point, as well as a ranking that establishes which data point is the most oulier and which one is the least outlier. In many cases, it is the application itself that requires such a complete outcome; that is, not only knowing if a data point is an outlier, but to what extent it is an outlier.

However, in some cases, it would not even be appropriate to refer the internal scores as “outlierness scores”, since their dynamic characteristics many times cannot be correctly interpreted in terms of “outlier quality” (Kriegel et al. 2011). To give an example: if the score of point i is twice that of point j, that rarely means that its outlierness is twice as large. Moreover, it seems logical to assume that if two algorithms output equivalent binary solutions—or even the same set of ranks—it is because their scores have similar dynamics. Our study refutes this hypothesis.

We can anticipate these intuitions with the simple example of Fig. 1, in which ten points are clustered around the origin, while two outliers appear at (1,1) and (2,2). Here, different outlier detection algorithms (introduced in Sect. 2.3) provide different scores. The outlier (2,2) gets the maximum score by all algorithms, which translates to 1 if normalized. However, the difference to the normalized score of outlier (1,1) ranges from 0 (ABOD, HBOS) to 0.51 (SDO, K-NN). Some of the algorithms give scores that are proportional to the distance to the data bulk (K-NN, OCSVM, SDO), others might suggest estimations closer to probabilities (iForest, LOF, GLOSH, HBOS), and others neither of both options (ABOD).

Understanding the dynamic characteristics of the scores means realizing how their distributions depend on both the data and the algorithm used. This can help to infer how the difference between the scores of two points translates into the feature space.

The dynamic characteristics of the scores used by outlier detection algorithms have not received much attention in the literature, even though they have a key impact on the final results and their knowledge in contrast to the nature of the data determines the success and adequacy to the final application. In other words, in addition to facilitating a better interpretation of the scores per se, understanding score dynamics allows us to select the most appropriate algorithm for a particular real-life scenario, refine the criteria for setting thresholds between inliers and outliers, and anticipate how changes in data geometries will affect the scores. In this paper, we study the topic and explore whether algorithms exhibit dynamics of their own independent of the data and how variations in specific data aspects affect such dynamics. Among the different types of anomalies categorized (Chandola et al. 2009), we focus on the most common type, i.e., points anomalies, either global or local. Our study uncovers particular algorithm idiosyncrasies and how each of them shows sensitivity to a different pattern of perturbation. Therefore, the main contributions of this work are:

We provide an extensive and critical overview of methods for evaluating outlier detection algorithms beyond accuracy.
We propose a set of novel measures to abstract and capture the dynamic behavior of outlier scores. These measures serve to gain a deeper understanding of algorithms and, at the same time, provide information to understand and diagnose the data space as a whole.
We study and reveal through sensitivity analysis how established anomaly detection algorithms are perturbed and change their behavior when facing categorized variations in data geometries.
We analyze the effect of Gaussian normalization on score dynamics. Gaussian normalization is established in the field as an agnostic method for generating probabilistically interpretable scores (Kriegel et al. 2011).
Our experiments disclose knowledge about the studied algorithms that facilitates the selection of the most appropriate alternative according to the data and feature space characteristics.

The rest of the paper is organized as follows: Sect. 2 brings up some relevant works in the field of outlier detection calibration and introduces the studied algorithms. Section 3 discusses traditional evaluations in terms of accuracy and presents new methods of assessing the dynamic characteristics of performances. Section 4 conducts different sensitivity analyses and discusses algorithms’ responses. Conclusions are exposed in Sect. 5.

2 Anomaly detection

2.1 Comparison of algorithms

Numerous studies have discussed and compared existing methods for anomaly detection. They commonly align with two main trends: either exploring the particularities of the problem, its cases, extent and details in different fields of application (e.g., Chandola et al. 2009; Zimek and Filzmoser 2018; Ruff et al. 2021; Boukerche et al. 2020; Blázquez-García et al. 2021); or by conducting exhaustive comparisons between a wide variety of algorithms and datasets (e.g., Campos et al. 2016; Falcão et al. 2019; Domingues et al. 2018; Steinbuss and Böhm 2021; Han et al. 2022).

Some of these surveys also commonly focus on specific fields, for instance, network traffic (Ahmed et al. 2016), urban traffic (Djenouri and Zimek 2018), or medical applications (Fernando et al. 2021). However, in the related literature, evaluations and comparisons are made almost exclusively in terms of accuracy, i.e., to what extent algorithms under test are able to match the ideal partition expressed by a certain Ground Truth. Observations on why some algorithms fail in certain circumstances, or what their weaknesses are, or what properties the data must have in order for the algorithms to work properly, are rather given in papers presenting or describing new methods. Hence, our work is framed within studies that delve deeper into space analysis and aim to reveal the bidirectional effect and interdependency between algorithms and data. Beyond acknowledging that the performance of outlier detection methods depends on the dataset characteristics and preprocessing steps (Kandanaarachchi et al. 2020), we show how data peculiarities differently affects the dynamical behavior of analysis methods.

2.2 Interpretation of anomaly scores

The dynamic characteristics of algorithms are scarcely discussed. Here, the work by Kriegel et al. (2011) is a notable exception, where it is observed that “(s)tate-of-the-art outlier scores are not standardized and often hard to interpret. Scores of objects from different data sets and even scores of objects from the same data set cannot be compared”. To mitigate this inconvenience, Kriegel et al. design a method to regularize and normalize different scores “aiming (i) at an increased contrast between outlier scores vs. inlier scores and (ii) at deriving a rough probability value for being an outlier or not”. This initiative aligns with the work of Gao and Tan (2006), who propose transforming scores into probability estimates in order to facilitate the construction of ensembles. As empirical proof of this concept we could see, e.g., the implementation of an ensemble using scores transformed into probabilities, as well as the evaluation of its suitability (Schubert et al. 2012; Bauder and Khoshgoftaar 2017). Regularized and normalized scores are also used to determine the weight or importance of an object for internal evaluation measures (Marques et al. 2020, 2022).

However, rather than measuring the dynamic behavior of the scores, these studies attempt to overcome their variability or arbitrariness. It is only recently that measures to evaluate the stability of outlierness rankings against perturbations have been proposed by Perini et al. (2020a), thus providing a quantifiable indicator of algorithms’ stability. The same research group, by using the transformation into probabilities approach, also propose a method to estimate the confidence of algorithms in their outlierness scores (Perini et al. 2020b). The concept of generating a confidence score for an instance being anomalous is termed calibrated anomaly detection (Menon and Williamson 2018). Averaging confidence scores is then a way to provide global assessments of the algorithm’s confidence when processing a dataset. However, although very useful, these emerging methods are still limited for explaining the dynamics of outlier scoring algorithms.

2.3 Algorithms for outlier scoring

Algorithms perform different strategies to assign outlierness scores to data points. Behind these strategies lie the principles that shape their dynamic behavior. We can highlight some decisive aspects.

Problem space The most traditional and straightforward approach to scoring outliers is by calculating point distances in the input space, therefore deriving an outlierness score that is based on how close or far a given data point is with respect to other data points. It seems logical to think that measurements in the input space—usually Euclidean distances—would be the natural way to face the problem; however, not all feature spaces are consistent with a geometrical space interpretation, in addition to the complications inherent to distance measurements in high-dimensional spaces, i.e., the curse of dimensionality (Zimek et al. 2012; Thirey and Hickman 2015). Hence, some methods transform, project or simplify the input space into a different space that presents some kind of advantage or avoids some of the mentioned disadvantages.
Core properties Mixed data sets, hierarchically structured data or an increase in dimensionality are reasons that point distances may be perceived as a suboptimal property on which to solely base outlierness, opting instead for estimates of density, angles between points, similarity measurements, ad hoc functions or properties intrinsic to the space transformation employed. On the other hand, the advantage of distance-based measurements is their easy interpretability and proportionality when mapped to outlierness.
Perspective Another key aspect when deciding if a data point is an outlier is to determine compared to what? Most methods take an overall perspective and consider the whole dataset as a reference, whereas a few algorithms focus on the immediate proximity of the data point instead. The second option assumes that the anomaly is in its environment, while the first assumes that it is in the global picture. Both nuances are valid and, in practical cases, not always easy to distinguish (Schubert et al. 2014). Here, it is worth mentioning contextual (or conditional) anomalies, i.e., when they occur outside their usual context, although this requires the existence of features defined as “contextual” (Li and van Leeuwen 2023) or data with a temporal dimension (Hartl et al. 2024). In our work we focus on (local and global) point anomalies. However, some of the algorithms studied are also capable of capturing other types of anomalies.
Use of models The last aspect to mention is whether the algorithm uses a model to perform the estimations or, on the contrary, the computation is performed over normal data points. In the second case, all points are basically used only in the theoretical description of the algorithm, while practical implementations tend to reduce operations to the subset of the k-nearest neighbors (usually for computational reasons, but not only). Using models aims to speed up implementation phases, but also to obtain an abstraction that can help explain the data and extract further knowledge.

To investigate the dynamic characteristics of outlier detection scores, we select eight algorithms according to their popularity or novelty, but also looking for operation principles with a distinct basis. The four aspects exposed above serve as guidance to gather representatives (briefly contrasted in Table 1): ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO, and GLOSH. In the following we provide an intuitive introduction while referring the interested reader to the original sources for further details:

Table 1 Key features of the selected algorithms

What do anomaly scores actually mean? Dynamic characteristics beyond accuracy

Abstract

Similar content being viewed by others

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

Outlier Detection Techniques: A Comparative Study

A Survey on Anomaly Detection Strategies

Explore related subjects

1 Introduction

2 Anomaly detection

2.1 Comparison of algorithms

2.2 Interpretation of anomaly scores

2.3 Algorithms for outlier scoring

3 Evaluation methods

3.1 General notation

3.2 Accuracy indices

3.3 Internal validation

3.4 Stability and confidence

3.5 Estimating algorithm dynamics

4 Sensitivity analysis

4.1 Experimental setup

4.2 Overall performances

4.2.1 Dynamics of linearly normalized scores

4.2.2 Dynamics of Gaussian normalized scores

4.3 Effect of larger datasets

4.4 Effect of dimensionality

4.5 Effect of higher outlier ratios

4.6 Effect of variable outlier/inlier density differences

4.7 Sensitivity to multiple density layers

4.8 Effect of zonification

4.9 Local outliers

4.10 Interpretation and interdependence of indices

5 Conclusions

6 Supplementary information

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: Performance tables

Appendix 2: Evaluation on real datasets

1.1 Appendix 2.1: S-curves (linear normalization)

1.2 Appendix 2.2: S-curves (Gaussian normalization)

1.3 Appendix 2.3: Performance measurements (linear normalization)

1.4 Appendix 2.4: Performance measurements (Gaussian normalization)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation