Abstract
Scientific data often are high dimensional. In such data, finding outliers are challenging because they often are hidden in subspaces, i.e., lower-dimensional projections of the data. With recent approaches to outlier mining, the actual detection of outliers is decoupled from the search for subspaces likely to contain outliers. However, finding such sets of subspaces that contain most or even all outliers of the given data set remains an open problem. While previous proposals use per-subspace measures such as correlation in order to quantify the quality of subspaces, we explicitly take the relationship between subspaces into account and propose a dimension-based measure of that quality. Based on it, we formalize the notion of an optimal set of subspaces and propose the Greedy Maximum Deviation heuristic to approximate this set. Experiments on comprehensive benchmark data show that our concept is more effective in determining the relevant set of subspaces than approaches which use per-subspace measures.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal, C., Sathe, S.: Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17(1), 24–47 (2015)
Angiulli, F., Fassetti, F., Manco, G., Palopoli, L.: Outlying property detection with numerical attributes. Data Min. Knowl. Discov. 31, 134–163 (2017)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD 29(2), 93–104 (2000)
Campos, G.O., Zimek, A., Sander, J., Campello, R.J.G.B., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 891–927 (2016)
Duan, L., Tang, G., Pei, J., Bailey, J., Campbell, A., Tang, C.: Mining outlying aspects on numeric data. Data Min. Knowl. Discov. 29(5), 1116–1151 (2015)
Duan, L., Tang, G., Pei, J., Bailey, J., Dong, G., Nguyen, V., Campbell, A., Tang, C.: Efficient discovery of contrast subspaces for object explanation and characterization. Knowl. Inf. Syst. 47(1), 99–129 (2015)
Keller, F., Müller, E., Böhm, K.: HiCS: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048 (2012)
Keller, F., Müller, E., Wixler, A., Böhm, K.: Flexible and adaptive subspace search for outlier analysis. In: CIKM, pp. 1381–1390 (2013)
Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: VLDB, vol. 99, pp. 211–222 (1999)
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: PAKDD (2009)
Kriegel, H.P., Kroger, P., Schubert, E., Zimek, A.: Outlier detection in arbitrarily oriented subspaces. In: ICDM, pp. 379–388 (2012)
Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: KDD, pp. 444–452 (2008)
Lichman, M.: UCI machine learning repository. 2013 http://archive.ics.uci.edu/ml . Accessed 1 June 2017
Micenková, B., Dang, X.H., Assent, I., Ng, R.T.: Explaining outliers by subspace separability. In: ICDM, pp. 518–527 (2013)
Müller, E., Keller, F., Blanc, S., Böhm, K.: OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML PKDD, pp. 828–832 (2012)
Müller, E., Schiffer, M., Seidl, T.: Statistical selection of relevant subspace projections for outlier ranking. In: ICDE, pp. 434–445 (2011)
Nguyen, H.V., Ang, H.H., Gopalkrishnan, V.: Mining outliers with ensemble of heterogeneous detectors on random subspaces. In: DASFAA, pp. 368–383 (2010)
Nguyen, H.V., Müller, E., Böhm, K.: 4S: scalable subspace search scheme overcoming traditional apriori processing. In: Big Data Conference, pp. 359–367 (2013)
Nguyen, H.V., Müller, E., Vreeken, J., Keller, F., Böhm, K.: CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: ICDM, pp. 198–206 (2013)
Pang, G., Cao, L., Chen, L., Liu, H.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: ICDM (2016)
Pang, G., Cao, L., Chen, L., Liu, H.: Learning homophily couplings from non-IID data for joint feature selection and noise-resilient outlier detection. In: IJCAI (2017)
Pasillas-Díaz, J.R., Ratté, S.: Bagged subspaces for unsupervised outlier detection. Comput. Intell. 33, 507–523 (2016)
Pestov, V.: On the geometry of similarity search: dimensionality curse and concentration of measure. Inf. Process. Lett. 73(1), 47–51 (2000)
Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: ICDM, pp. 459–468 (2016)
Vinh, N.X., Chan, J., Romano, S., Bailey, J., Leckie, C., Ramamohanarao, K., Pei, J.: Discovering outlying aspects in large datasets. Data Min. Knowl. Discov. 30, 1520–1555 (2016)
Zhang, J., Gao, Q., Wang, H.: A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In: ICDM, pp. 731–740 (2006)
Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. SIGKDD Explor. Newsl. 15(1), 11–22 (2014)
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Acknowledgements
This work was supported by the German Research Foundation (DFG) as part of the Research Training Group GRK 2153: Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
About this article
Cite this article
Trittenbach, H., Böhm, K. Dimension-based subspace search for outlier detection. Int J Data Sci Anal 7, 87–101 (2019). https://doi.org/10.1007/s41060-018-0137-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-018-0137-7