Abstract
Heart disease, alternatively known as cardiovascular disease, is the primary basis of death worldwide over the past few decades. To make an early diagnosis, a data-driven prediction model considering the associate risk factors in heart disease can play a significant role in healthcare domain. However, to build such an effective model based on machine learning techniques, the quality of the data, e.g., data without “anomalies” or outliers, is important. This research investigates anomaly detection in the healthcare domain to effectively predict heart disease using unsupervised K-means clustering algorithm. Our proposed model first determines an optimal value of K using the Silhouette method to form the clusters for finding the anomalies. After that, we eliminate the identified anomalies from the data and employ the five most popular machine learning classification techniques, such as K-nearest neighbor, random forest, support vector machine, naive Bayes, and logistic regression to build the resultant prediction model. The efficacy of the proposed methodology is justified using a standard heart disease dataset. We also take into account the data plotting to test the exactness of the detection of anomalies in our experimental analysis.
Similar content being viewed by others
References
Altman N. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85.
Ayon SI, Islam MM, Hossain MR. Coronary artery heart disease prediction: a comparative study of computational intelligence techniques. IETE J Res. 2020;1–20.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Campello RJ, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2013. p. 160–72.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Cramer JS. The origins of logistic regression; 2002.
Dessai ISF. Intelligent heart disease prediction system using probabilistic neural network. Int J Adv Comput Theory Eng. 2013;2(3):2319–526.
Ding Z, Fei M. An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc Vol. 2013;46(20):12–7.
Fan J, Zhang Q, Zhu J, Zhang M, Yang Z, Cao H. Robust deep auto-encoding gaussian process regression for unsupervised anomaly detection. Neurocomputing. 2020;376:180–90.
Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.
Fujimaki R, Yairi T, Machida K. An approach to spacecraft anomaly detection problem using kernel feature space. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining; 2005. p. 401–10.
Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.
Janakiram D, Reddy V, Kumar AP. Outlier detection in wireless sensor networks using Bayesian belief networks. In: 2006 1st International conference on communication systems software & middleware. IEEE; 2006. p. 1–6.
Kumar V. Parallel and distributed computing for cybersecurity. IEEE Distrib Syst Online. 2005;6(10).
Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE; 2008. pp. 413–22.
Mascaro S, Nicholso AE, Korb KB. Anomaly detection in vessel tracks using Bayesian networks. Int J Approx Reason. 2014;55(1):84–98.
Mohamed MS, Kavitha T. Outlier detection using support vector machine in wireless sensor network real time data. Int J Soft Comput Eng. 2011;1(2).
Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access. 2019;7:81542–54.
Münz G, Li S, Carle G. Traffic anomaly detection using k-means clustering. In: GI/ITG workshop MMBnet; 2007. p. 13–4.
Nachman B, Shih D. Anomaly detection with density estimation. Phys Rev D. 2020;101(7):075042.
Ranjith R, Athanesious JJ, Vaidehi V. Anomaly detection using dbscan clustering technique for traffic video surveillance. In: 2015 Seventh international conference on advanced computing (ICoAC). IEEE; 2015. p. 1–6.
Ripan RC, Sarker IH, Furhad MH, Anwar MM, Hoque MM. An effective heart disease prediction model based on machine learning techniques; 2020.
Ronit: Heart disease uci; 2018. https://www.kaggle.com/ronitf/heart-disease-uci.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.
Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.
Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.
Sarker IH, Hoque MM, Uddin MK, Alsanoosy T. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl. 2020;1–19.
Sarker IH, Kayes A. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020;102762.
Sarker IH, Kayes A, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.
Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):57.
Spence C, Parra L, Sajda P. Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In: Proceedings IEEE workshop on mathematical methods in biomedical image analysis (MMBIA 2001). IEEE; 2001. p. 3–10.
Sun L, Versteeg S, Boztas S, Rao A. Detecting anomalous user behavior using an extended isolation forest algorithm: an enterprise case study. 2016. arXiv preprint. arXiv:1609.06676.
Tax DM, Duin RP. Support vector data description. Mach Learn. 2004;54(1):45–66.
Tu B, Yang X, Li N, Zhou C, He D. Hyperspectral anomaly detection via density peak clustering. Pattern Recognit Lett. 2020;129:144–9.
Wickham H, Stryjewski L. 40 years of boxplots. Am. Stat. 2011.
Xu J, Shelton CR. Intrusion detection using continuous time Bayesian networks. J Artif Intell Res. 2010;39:745–74.
Xue Z, Shang Y, Feng A. Semi-supervised outlier detection based on fuzzy rough c-means clustering. Math Comput Simul. 2010;80(9):1911–21.
Yoon KA, Kwon OS, Bae DH. An approach to outlier detection of software measurement data using the k-means clustering method. In: First international symposium on empirical software engineering and measurement (ESEM 2007. IEEE; 2007. p. 443–5.
Zhang C, Song D, Chen Y, Feng X, Lumezanu C, Cheng W, Ni J, Zong B, Chen H, Chawla NV. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33; 2019. p. 1409–16.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ripan, R.C., Sarker, I.H., Hossain, S.M.M. et al. A Data-Driven Heart Disease Prediction Model Through K-Means Clustering-Based Anomaly Detection. SN COMPUT. SCI. 2, 112 (2021). https://doi.org/10.1007/s42979-021-00518-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00518-7