[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Abstract

Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014)

    Article  Google Scholar 

  2. Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)

    Book  MATH  Google Scholar 

  3. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 2000 (2000)

    Google Scholar 

  4. Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. The MIT Press, Cambridge (2009)

    Google Scholar 

  5. Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)

    Article  MathSciNet  Google Scholar 

  6. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)

    Book  MATH  Google Scholar 

  7. Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them). Pract. Assess. Res. Eval. 9(6), 1–12 (2004)

    Google Scholar 

  8. Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)

    Article  Google Scholar 

  9. Haug, A., Zachariassen, F., Van Liempd, D.: The costs of poor data quality. J. Ind. Eng. Manag. 4(2), 168–193 (2011)

    Google Scholar 

  10. English, L.P.: Information quality: critical ingredient for national security. J. Database Manag. 16(1), 18–32 (2005)

    Article  Google Scholar 

  11. of Inspector General, O.: Undeliverable as addressed mail. Tech. Rep. MS-AR-14-006, United States Postal Service (2014)

  12. Quality, E.D.: The data quality benchmark report. In: Experian Data Quality, pp. 1–10 (2015)

  13. Koh, H.C., Tan, G., et al.: Data mining applications in healthcare. J. Healthc. Inf. Manag. 19(2), 65 (2011)

    Google Scholar 

  14. Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inf. Assoc. 20(1), 144–151 (2013)

    Article  Google Scholar 

  15. Rosenberg, W., Donald, A.: Evidence based medicine: an approach to clinical problem-solving. BMJ Br. Med. J. 310(6987), 1122 (1995)

    Article  Google Scholar 

  16. Md, A.R.F., Md, R.I.H.: Problems in the evidence of evidence-based medicine. Am. J. Med. 103(6), 529–535 (1997)

    Article  Google Scholar 

  17. Berndt, D.J., Fisher, J.W., Hevner, A.R., Studnicki, J.: Healthcare data warehousing and quality assurance. Computer 34(12), 56–65 (2001)

    Article  Google Scholar 

  18. Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)

    Article  Google Scholar 

  19. Godfrey, A.B.: Juran’s Quality Handbook. McGraw Hill, New York (1999)

    Google Scholar 

  20. Redman, T.C.: Data Quality: The Field Guide. Digital press, Boston (2001)

    Google Scholar 

  21. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer, New York, Secaucus (2006)

    MATH  Google Scholar 

  22. Chapman, A.D.: Principles of data quality. Tech. rep., Global Biodiversity Information Facility, Copenhagen (2005)

  23. Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16:1–16:52 (2009)

    Article  Google Scholar 

  24. Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan and Claypool, San Rafael (2012)

    Book  MATH  Google Scholar 

  25. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)

    MATH  Google Scholar 

  26. Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: MIT Conference on Information Quality, pp. 200–209 (2000)

  27. Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., Herbst, K.: Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2(10), e267 (2005)

    Article  Google Scholar 

  28. Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  29. Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)

    Book  MATH  Google Scholar 

  30. Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)

    Article  MATH  Google Scholar 

  31. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey (1961)

    Book  MATH  Google Scholar 

  32. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer, Berlin (2001)

    Book  MATH  Google Scholar 

  33. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Hoboken (1994)

    MATH  Google Scholar 

  34. Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)

    Article  MathSciNet  Google Scholar 

  35. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)

    Article  Google Scholar 

  36. Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: In Proceedings of the International Conference on Very Large Databases, pp. 392–403 (1998)

  37. Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, vol. 1998, pp. 224–228. AAAI Press (1998)

  38. Ramaswamy, S., Rastogi, R., Shim, K., Ramaswamy, S., Rajeev rastogi, K.S.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec. 29(2), 427–438 (2000)

    Article  Google Scholar 

  39. Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: identifying density-based local outliers. ACM Sigmod Record, pp. 1–12 (2000)

  40. Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 315–326. IEEE (2003)

  41. Ghoting, A., Parthasarathy, S., Otey, M.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)

    Article  MathSciNet  Google Scholar 

  42. Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 410–421 (2011)

  43. Kriegel, H.P., S hubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 444 (2008)

  44. Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1996), 153–168 (1996)

    Article  MATH  Google Scholar 

  45. Muller, E., Schiffer, M.: Statistical selection of relevant subspace projections for outlier ranking. Data Eng. (ICDE) 2011, 434–445 (2011)

    Google Scholar 

  46. Zhang, J., Wang, H.: Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst. 10(3), 333–355 (2006)

    Article  MathSciNet  Google Scholar 

  47. Keller, F.: HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of ICDE (1) (2012)

  48. Knorr, E.M., Ng, R.T.: Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 211–222 (1999)

  49. Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: ACM Sigmod Record (2001)

  50. Zhang, J., Lou, M., Ling, T.: Hos-Miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, pp. 1265–1268 (2004)

  51. Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1, pp. 831–838 (2009)

  52. Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 17 (2015)

    Article  Google Scholar 

  53. Agrawal, R., Gehrke, J., Gunopulos, D.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)

  54. Datta, A., Kaur, A., Lauer, T., Chabbouh, S.: Parallel subspace clustering using multi-core and many-core architectures. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A., Gamper, J., Wrembel, R., Darmont, J., Rizzi, S. (eds.) New Trends in Databases and Information Systems, pp. 213–223. Springer, Cham (2017)

    Chapter  Google Scholar 

  55. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 4 Apr 2017

  56. Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amitava Datta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, A., Datta, A. Detecting and ranking outliers in high-dimensional data. Int J Adv Eng Sci Appl Math 11, 75–87 (2019). https://doi.org/10.1007/s12572-018-0240-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12572-018-0240-y

Keywords

Navigation