Abstract.
The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.
Similar content being viewed by others
References
Aggarwal CC (2001) Re-designing distance functions and distance based applications for high dimensional data. ACM SIGMOD Rec 30(1):13-18
Aggarwal CC(1999) Fast algorithms for projected clustering. In: Proceedings of ACM SIGMOD, pp 61-72
Aggarwal CC, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD, pp 70-81
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of ICDT, pp 420-434
Aggarwal CC, Orlin JB, Tai RP (1997) Optimized crossover for the independent set problem. Operat Res 45(2):226-234
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD, pp 94-105
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD, pp 207-216
Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of KDD, pp 164-169
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings of ICDT, pp 217-235
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, pp 93-104
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of the VLDB conference, pp 89-104
Darwin C (1859) The origin of species by natural selection. Available at: http://www.literature.org/authors/darwin-charles/the-origin-of-species/
De Jong KA (1975) Analysis of the behaviour of a class of genetic adaptive systems. PhD dissertation, University of Michigan, Ann Arbor, MI
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226-231
Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD, pp 73-84
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the VLDB conference, pp 506-515
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MI
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4589):671-680
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. In: Proceedings of the VLDB conference, pp 392-403
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the VLDB conference, pp 211-222
Parthasarathy, S Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6):1512-1531
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD, pp 427-438
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103-114
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 19 November 2002, Accepted: 6 February 2004, Published online: 19 August 2004
Edited by: R. Ng.
Rights and permissions
About this article
Cite this article
Aggarwal, C.C., Yu, P.S. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal 14, 211–221 (2005). https://doi.org/10.1007/s00778-004-0125-5
Issue Date:
DOI: https://doi.org/10.1007/s00778-004-0125-5