Abstract
Outlier detection is an important data mining task that has attracted substantial attention within diverse research communities and the areas of application. By now, many techniques have been developed to detect outliers. However, most existing research focus on numerical data. And they can not directly apply to categorical data because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, a weighted density definition is given firstly, which takes account of the density and uncertainty of objects in every attributes simultaneously. Furthermore, a simple and effective outlier detection algorithm for categorical data based on the given weighted density is proposed. The corresponding time complexity of the algorithm is analyzed as well. Experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of our proposed algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey. ACM Comput Surv 41(3):Article 15
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
Kumar V (2005) Parallel and distributed computing for cybersecurity. IEEE Distrib Syst Online 6(10). doi:10.1109/MDSO.2005.53
Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning
Han JW, Kamber M (2011) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc, San Francisco
Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley, Chichester
Knorr E, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th VLDB conference, New York, pp 392–403
Knorr EM, Ng RT (1999) Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th international conference on very large databases, Edinburgh, Scotland, pp 211–222
Knorr EM, Ng RT, Tucakovand V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253
Tang CL, Wang SG, Xu W (2010) New fuzzy c-means clustering model based on the data weighted approach. Data Knowl Eng 69:881–900
Li SX, Lee R, Lang SD (2007) Mining distance-based outliers from categorical data. In Proceedings of the 7th IEEE international conference on data mining workshops, Washington, pp 225–230
He ZY, Xu XF, Huang JZ, Deng SC (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12:203–228
He ZY, Deng SC, Xu XF (2005) An optimization model for outlier detection in categorical data. In: Proceedings of the 2005 international conference on advances in intelligent computing, Hefei, pp 400–409
He ZY, Deng SC, Xu XF, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Proceedings of the 10th Pacific-Asia conference on knowledge and data discovery, pp 567–576
Jiang F, Sui YF, Cao CG (2008) A rough set approach to outlier detection. Int J Gen Syst 37(5):519–536
Jiang F, Sui YF, Cao CG (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687
Jiang F, Sui YF, Cao CG (2010) An information entropy-based approach to outlier detection in rough sets. Expert Syst Appl 37(9):6338C6344
Cao FY, Liang JY, Bai L (2009) A new initialization method for categorical data clustering. Expert Syst Appl 36(7):10223–10228
Liang X, Wei CP (2013) An Atanassov’s intuitionistic fuzzy multi-attribute group decision making method based on entropy and similarity measure. Int J Mach Learn Cybern. doi:10.1007/s13042-013-0178-0
Guan PP, Yan H (2012) A hierarchical multilevel thresholding method for edge information extraction using fuzzy entropy. Int J Mach Learn Cybern 3(4):297–305
Shannon CE (1948) A mathematical theory of communiction. Bell Syst Tech J 27(3–4):379–423
Liang JY, Chin KS, Dang CY (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342
Liang JY, Zhao XW, Li DY, Cao FY, Dang CY (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251-2265
Cao FY, Liang JY, Li DY, Zhao XW (2013) A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing 108:23–30
Qian YH, Liang JY, Pedrycz W, Dang CY (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9-10):597–618
Liang JY, Wang F, Dang CY, Qian YH (2012) A group incremental approach to feature selection applying rough set technique. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2012.146
Qian YH, Liang JY, Li DY, Zhang HY, Dang CY (2008) Measures for evaluating the decision performance of a decision table in rough set theory. Inf Sci 8(1):181–202
Liang JY, Shi ZZ, Li DY, Wierman MJ (2006) The information entropy, rough entropy and knowledge granulation in incomplete information system. Int J Gen Syst 35(6):641–654
Xu ZY, Liu ZP, Yang BR, Song W (2006) A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|)). Chin J Comput 29(3):391–398
UCI Machine Learning Repository 2012 http://archive.ics.uci.edu/ml/datasets.html
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on managment of data, California, pp 37–46
Hawkins S, He HX, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: Proceedings of the 5th international conference and data warehousing and knowledge discovery
Cristofor D, Simovici D (2002) Finding median partitions using information-theoretical algorithms. J Univers Comput Sci 8(2):153–172 (software at http://www.cs.umb.edu/~dana/GAClust/index.html)
Acknowledgements
The authors are very grateful to the anonymous reviewers and editor. Their many helpful and constructive comments and suggestions helped us significantly improve this work. This work was supported by the National Natural Science Foundation of China (No. 71031006), the Foundation of Doctoral Program Research of Ministry of Education of China (No. 20101401110002), the Construction Project of the Science and Technology Basic Condition Platform of Shanxi Province (No. 2012091002-0101) and Shanxi Scholarship Council of China (No. 2013-101).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, X., Liang, J. & Cao, F. A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. & Cyber. 5, 469–477 (2014). https://doi.org/10.1007/s13042-013-0202-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-013-0202-4