Abstract
Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.
Similar content being viewed by others
References
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014)
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 2000 (2000)
Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. The MIT Press, Cambridge (2009)
Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)
Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them). Pract. Assess. Res. Eval. 9(6), 1–12 (2004)
Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)
Haug, A., Zachariassen, F., Van Liempd, D.: The costs of poor data quality. J. Ind. Eng. Manag. 4(2), 168–193 (2011)
English, L.P.: Information quality: critical ingredient for national security. J. Database Manag. 16(1), 18–32 (2005)
of Inspector General, O.: Undeliverable as addressed mail. Tech. Rep. MS-AR-14-006, United States Postal Service (2014)
Quality, E.D.: The data quality benchmark report. In: Experian Data Quality, pp. 1–10 (2015)
Koh, H.C., Tan, G., et al.: Data mining applications in healthcare. J. Healthc. Inf. Manag. 19(2), 65 (2011)
Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inf. Assoc. 20(1), 144–151 (2013)
Rosenberg, W., Donald, A.: Evidence based medicine: an approach to clinical problem-solving. BMJ Br. Med. J. 310(6987), 1122 (1995)
Md, A.R.F., Md, R.I.H.: Problems in the evidence of evidence-based medicine. Am. J. Med. 103(6), 529–535 (1997)
Berndt, D.J., Fisher, J.W., Hevner, A.R., Studnicki, J.: Healthcare data warehousing and quality assurance. Computer 34(12), 56–65 (2001)
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Godfrey, A.B.: Juran’s Quality Handbook. McGraw Hill, New York (1999)
Redman, T.C.: Data Quality: The Field Guide. Digital press, Boston (2001)
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer, New York, Secaucus (2006)
Chapman, A.D.: Principles of data quality. Tech. rep., Global Biodiversity Information Facility, Copenhagen (2005)
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16:1–16:52 (2009)
Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan and Claypool, San Rafael (2012)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)
Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: MIT Conference on Information Quality, pp. 200–209 (2000)
Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., Herbst, K.: Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2(10), e267 (2005)
Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008)
Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey (1961)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer, Berlin (2001)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Hoboken (1994)
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: In Proceedings of the International Conference on Very Large Databases, pp. 392–403 (1998)
Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, vol. 1998, pp. 224–228. AAAI Press (1998)
Ramaswamy, S., Rastogi, R., Shim, K., Ramaswamy, S., Rajeev rastogi, K.S.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec. 29(2), 427–438 (2000)
Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: identifying density-based local outliers. ACM Sigmod Record, pp. 1–12 (2000)
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 315–326. IEEE (2003)
Ghoting, A., Parthasarathy, S., Otey, M.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 410–421 (2011)
Kriegel, H.P., S hubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 444 (2008)
Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1996), 153–168 (1996)
Muller, E., Schiffer, M.: Statistical selection of relevant subspace projections for outlier ranking. Data Eng. (ICDE) 2011, 434–445 (2011)
Zhang, J., Wang, H.: Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst. 10(3), 333–355 (2006)
Keller, F.: HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of ICDE (1) (2012)
Knorr, E.M., Ng, R.T.: Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 211–222 (1999)
Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: ACM Sigmod Record (2001)
Zhang, J., Lou, M., Ling, T.: Hos-Miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, pp. 1265–1268 (2004)
Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1, pp. 831–838 (2009)
Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 17 (2015)
Agrawal, R., Gehrke, J., Gunopulos, D.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)
Datta, A., Kaur, A., Lauer, T., Chabbouh, S.: Parallel subspace clustering using multi-core and many-core architectures. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A., Gamper, J., Wrembel, R., Darmont, J., Rizzi, S. (eds.) New Trends in Databases and Information Systems, pp. 213–223. Springer, Cham (2017)
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 4 Apr 2017
Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kaur, A., Datta, A. Detecting and ranking outliers in high-dimensional data. Int J Adv Eng Sci Appl Math 11, 75–87 (2019). https://doi.org/10.1007/s12572-018-0240-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12572-018-0240-y