Abstract
Outlier detection, also known as anomaly detection, is a common data mining task in identifying data points that are outside expected patterns in a given dataset. It has useful applications such as network intrusion, system faults, and fraudulent activity. In addition, real world data are uncertain in nature and they may be represented as uncertain data. In this paper, we propose an improved parallel algorithm for outlier detection on uncertain data using density sampling and develop an implementation running on both GPUs and multi-core CPUs, using the OpenCL framework. Our main focus is on GPUs, as they are a cost effective massively parallel floating point processor that is suitable for many data mining applications. Our implementation exploits some key features in GPUs, and is significantly different from a traditional CPU implementation. We first present an improved uncertain outlier detection algorithm. Then, we demonstrate two parallel micro-clustering implementations. The performance and detection quality comparisons demonstrate the benefits of the improved algorithm and parallel implementation on GPUs.
Similar content being viewed by others
References
Acklam, P.J.: An algorithm for computing the inverse normal cumulative distribution function. Tech. Rep. (2003)
Advanced Micro Devices Inc: AMD accelerated parallel processing opencl programming guide
Aggarwal, C.C. (ed.): Managing and Mining Uncertain Data. Springer, New York, NY (2009)
Aggarwal, C.C., Yu, P.S.: Outlier detection with uncertain data. In: Proceedings of the SIAM International Conference on Data Mining 2008 (2008)
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE TKDE 21(5), 609–623 (2009)
Alshawabkeh, M., Jang, B., Kaeli, D.: Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (2010)
Azmandian, F., Yilmazer, A., Dy, J.G., Aslam, J.A., Kaeli, D.R.: GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In: Proceedings of the 12th IEEE ICDM (2012)
Bastke, S., Deml, M., Schmidt, S.: Combining statistical network data, probabilistic neural networks and the computational power of GPUs for anomaly detection in computer networks. In: 1st Workshop on Intelligent Security (Security and Artificial Intelligence) (2009)
Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Stat. Sci. 17(3), 235–255 (2002)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of SIGMOD 2000 (2000)
Chau, M., Cheng, R., Kao, B., Ng, J.: Uncertain data mining: an example in clustering location data. In: Proceedings of the 10th PAKDD (2006)
Denoeux, T.: Maximum likelihood estimation from uncertain data in the belief function framework. IEEE TKDE 25(1), 119–130 (2013)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (1996)
Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection of outliers. IEEE TKDE 18(2), 145–160 (2006)
Fang, W., Lau, K.K., Lu, M., Xiao, X., Lam, C.K., Yang, P.Y., He, B., Luo, Q., Sander, P.V., Yang, K.: Parallel data mining on graphics processors. Tech. Rep., Hong Kong University of Science and Technology (2008)
He, B., Govindaraju, N.K., Luo, Q., Smith, B.: Efficient gather and scatter operations on graphics processors. In: Proceedings of the ACM/IEEE Conference on Supercomputing (2007)
Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Databases 12(1), 5–26 (2002)
Kao, B., Lee, S.D., Cheung, D.W., Ho, W.S., Chan, K.F.: Clustering uncertain data using voronoi diagrams. In: Proceedings of the 8th IEEE ICDM (2008)
Khronos Group: OpenCL. http://www.khronos.org/opencl/ (2011). Accessed 9 October 2012
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of VLDB 1998 (1998)
Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: Proceedings of VLDB 1999, pp. 211–222 (1999)
Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings of the 11th ACM SIGKDD (2005)
Kriegel, H.P., Pfeifle, M.: Heirarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE ICDM (2005)
Krulis, M., Skopal, T., Lokoc, J., Beecks, C.: Combining CPU and GPU architectures for fast similarity search. Distrib. Parallel Databases 30, 179–207 (2012)
Lan, Z., Zheng, Z., Li, Y.: Toward automated anomaly identification in large-scale systems. IEEE TPDS 21(2), 174–187 (2010)
Lozano, E., Acuna, E.: Parallel algorithms for distance-based and density-based outliers. In: Proceedings of the 5th IEEE ICDM (2005)
Lu, M., Tan, Y., Bai, G., Luo, Q.: High-performance short sequence alignment with GPU acceleration. Distrib. Parallel Databases 30, 385–399 (2012)
Marsaglia, G.: Xorshift RNGs. J. Stat. Softw. 8(14), 1–6 (2003)
Matsumoto, T., Hung, E.: Accelerating outlier detection with uncertain data using graphics processors. In: Advances in Knowledge Discovery and Data Mining, vol. LNCS 7302, pp. 169–180 (2012)
Micikevicius, P.: Analysis-driven optimization. In: GPU Technology Conference 2010 (2010)
Murakami, T., Kasahara, R., Saito, T.: An implementation and its evaluation of password cracking tool parallelized on GPGPU. In: Proceedings of the 2010 International Symposium on Communications and Information Technologies (2010)
Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: Proceedings of the 6th IEEE ICDM (2006)
NVIDIA Corporation: CUDA. http://www.nvidia.com/object/cuda_home_new.html (2011). Accessed 9 October 2012
Reif, M., Goldstein, M., Stahl, A.: Anomaly detection by combining decision trees and parametric densities. In: 19th International Conference on Pattern Recognition 2008 (2008)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD (2000)
Sequeria, K., Zaki, M.: ADMIT: Anomaly-based data mining for intrusions. In: Proceedings of the 8th ACM SIGKDD (2002)
Tang, J., Chen, Z., Fu, A.W., Cheung, D.W.: Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl. Inf. Syst. 11(1), 45–84 (2006)
Tarabalka, Y., Haavardsholm, T.V., Kaasen, I., Skauli, T.: Real-time anomaly detection in hyperspectral images using multivariate normal mixture models and GPU processing. J. Real-Time Image Process. 4(3), 287–300 (2009)
Wang, L., Cheung, D.W.L., Cheng, R., Lee, S.D., Yang, X.S.: Efficient mining of frequent item sets on large uncertain databases. IEEE TKDE 24(12), 2170–2183 (2012)
Zhanchun, G., Yuying, L.: Improving the collaborative filtering recommender system by using GPU. In: Proceedings of 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (2012)
Zhang, Y., Lin, X., Tao, Y., Zhang, W., Wang, H.: Efficient computation of range aggregates against uncertain location-based queries. IEEE TKDE 24(7), 1244–1258 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Matsumoto, T., Hung, E. & Yiu, M.L. Parallel outlier detection on uncertain data for GPUs. Distrib Parallel Databases 33, 417–447 (2015). https://doi.org/10.1007/s10619-014-7155-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-014-7155-9