COVID-19 Prediction Applying Supervised Machine Learning Algorithms with Comparative Analysis Using WEKA
<p>The five phases of this study are data collection, preprocessing, modelling, comparative analysis and finding the best model to determine COVID-19 presence.</p> "> Figure 2
<p>The main page of WEKA displaying the five modules including Explorer, Experimenter, Knowledge Flow, Workbench and Simple CLI.</p> "> Figure 3
<p>The Preprocess tab of WEKA Explorer showing the composition of the imported dataset and some visualizations. Users can click the Visualize All button to become more familiarized with the dataset.</p> "> Figure 4
<p>The Preprocess tab of WEKA Explorer showing the result of balancing the dataset through SMOTE using a bar graph for representing the number of instances per class.</p> "> Figure 5
<p>The Preprocess tab of WEKA Explorer showing the result of balancing the dataset through Spread Subsample or undersampling the majority class to make it equal to the minority class.</p> "> Figure 6
<p>The Classify tab of WEKA Explorer wherein the user can choose different algorithms to be applied to the dataset. The details concerning the developed model’s performance were displayed in the classifier output section.</p> "> Figure 7
<p>The design of the process from loading the dataset, to training and testing using machine learning algorithms and the performance classification using the WEKA Knowledge Flow module.</p> "> Figure 8
<p>This figure shows the barchart of the results of the major accuracy measures of the developed model using each algorithm.</p> "> Figure 9
<p>The bargraph that shows the number of correctly classified instances of the developed models. The blue bar represents the correctly classified instances and the red bar is the misclassified instances.</p> "> Figure 10
<p>The barcharts for the kappa statistics (<b>a</b>), mean absolute error (<b>b</b>) and time taken to build the model (<b>c</b>).</p> ">
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Collection
2.2. Data Processing
2.3. Modelling
- J48 Decision Tree
- 2.
- Random Forest
- 3.
- Naïve Bayes
- 4.
- Support Vector Machine
- 5.
- k-Nearest Neighbors
2.4. Comparative Analysis
- Accuracy
- 2.
- Correctly and Incorrectly Classified Instances
- 3.
- Kappa Statistic
- 4.
- Mean Absolute Error (MAE)
- 5.
- Time taken to build the model
2.5. Finding the Best Model
- Highest accuracy, precision, recall and F-measure;
- Highest correctly classified instances;
- Lowest incorrectly classified instances;
- Highest kappa statistic score;
- Lowest mean absolute error;
- Lowest time taken to build the model.
3. Results
3.1. The COVID-19 Symptoms and Presence Dataset
3.2. Modelling
3.2.1. Hyperparameter Optimization
3.2.2. Results for Comparative Analysis
4. Discussion
5. Conclusions
- Individuals can easily check the possibility of acquiring COVID-19 based on symptoms;
- This study can be used as a preliminary patient assessment for medical practitioners;
- Helping businesses to restrict physical contact with customers possibly having COVID-19;
- This study can serve as an additional self-management tool for quarantine facilities to monitor if the person have developed COVID-19 symptoms while in isolation;
- The community and government can use this study as a tool to reduce the spread of the virus through early detection of COVID-19.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization (WHO). Coronavirus 2021. Available online: https://www.who.int/health-topics/coronavirus (accessed on 23 May 2021).
- Temgoua, M.N.; Endomba, F.T.; Nkeck, J.R.; Kenfack, G.U.; Tochie, J.N.; Essouma, M. Coronavirus Disease 2019 (COVID-19) as a Multi-Systemic Disease and its Impact in Low- and Middle-Income Countries (LMICs). SN Compr. Clin. Med. 2020, 2, 1377–1387. [Google Scholar] [CrossRef]
- Ames, H. How Long Does Coronavirus Last in the Body, Air, and in Food? Available online: https://www.medicalnewstoday.com/articles/how-long-does-coronavirus-last (accessed on 11 June 2020).
- Worldometer. COVID Live Update, 29 June 2021. Available online: https://www.worldometers.info/coronavirus/ (accessed on 29 June 2021).
- Centers for Disease Control and Prevention (CDC). SARS-Cov-2 Variant Classifications and Definitions, 17 May 2021. Available online: https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/variant-surveillance/variant-info.html (accessed on 23 May 2021).
- Wynants, L.; Van Calster, B.; Collins, G.S.; Riley, R.D.; Heinze, G.; Schuit, E.; Bonten, M.M.J.; Dahly, D.L.; Damen, J.A.; Debray, T.P.A.; et al. Prediction models for diagnosis and prognosis of COVID-19: Systematic review and critical appraisal. BMJ 2020, 369, m1328. [Google Scholar] [CrossRef] [Green Version]
- Supervised vs. Unsupervised Learning: Key Differences. Available online: https://www.guru99.com/supervised-vs-unsupervised-learning.html (accessed on 27 May 2021).
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef] [Green Version]
- Abdar, M.; Ksiazek, W.; Acharya, U.R.; Tan, R.-S.; Makarenkov, V.; Plawiak, P. A new machine learning technique for an accurate diagnosis of coronary artery disease. Comput. Methods Programs Biomed. 2019, 179, 104992. [Google Scholar] [CrossRef] [PubMed]
- Jinny, V.; Priya, R.L. Prediction Model for Respiratory Diseases Using Machine Learning Algorithms. Int. J. Adv. Sci. Technol. 2020, 29, 10083–10092. [Google Scholar]
- Asri, H.; Mousannif, H.; Al Moatassime, H.; Noel, T. Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis. Procedia Comput. Sci. 2016, 83, 1064–1069. [Google Scholar] [CrossRef] [Green Version]
- Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
- Bansal, D.; Chhikara, R.; Khanna, K.; Goopta, P. Comparative Analysis of Various Machine Learning Algorithms for Detecting Dementia Detecting Dementia. Procedia Comput. Sci. 2018, 132, 1497–1502. [Google Scholar] [CrossRef]
- Rahman, A.S.; Shamrat, F.J.M.; Tasnim, Z.; Roy, J.; Hosain, S.A. A Comparative Study On Liver Disease Prediction Using Supervised Machine Learning Algorithms. Int. J. Sci. Technol. Res. 2019, 8, 419–422. [Google Scholar]
- Turabieh, H.; Karaa, W.B.A. Predicting the existence of COVID-19 using machine learning based on laboratory findings. In Proceedings of the 2021 International Conference of Women in Data Science at Taif University, Taif, Saudi Arabia, 30–31 March 2021. [Google Scholar]
- Luo, J.; Zhou, L.; Feng, Y.; Bo, L.; Guo, S. The selection of indicators from initial blood routine test results to improve the accuracy of early prediction of COVID-19 severity. PLoS ONE 2021, 16, e0253329. [Google Scholar] [CrossRef]
- Rangarajan, A.; Krishnaswamy, R.; Krishnan, H. A preliminary analysis of AI based smartphone application for diagnosis of COVID-19 using chest X-ray images. Expert Syst. Appl. 2021, 183, 1–11. [Google Scholar] [CrossRef]
- Yan, L.; Zhang, H.-T.; Goncalves, J.; Xiao, Y.; Wang, M.; Guo, Y.; Sun, C.; Tang, X.; Jing, L.; Zhang, M.; et al. An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. 2020, 2, 283–288. [Google Scholar] [CrossRef]
- Khalilpourazari, S.; Doulabi, H.H. Robust modelling and prediction of the COVID-19 pandemic in Canada. Int. J. Prod. Res. 2021, 1–17. [Google Scholar] [CrossRef]
- Majumder, P. Chapter 10-Daily confirmed cases and deaths prediction of novel coronavirus in Asian continent Polynomial Neural Network. In Biomedical Engineering Tools for Management for Patients with COVID-1; Academic Press: Cambridge, MA, USA, 2021; pp. 163–172. [Google Scholar]
- Sanchez-Caballero, S.; Selles, M.A.; Peydro, M.A.; Perez-Bernabeu, E. An Efficient COVID-19 Prediction Model Validated with the Cases of China, Italy and Spain: Total or Partial Lockdowns? J. Clin. Med. 2020, 9, 1547. [Google Scholar] [CrossRef] [PubMed]
- Weka 3-Data Mining with Open Source Machine Learning Software in Java. Available online: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 27 May 2021).
- Kaggle. Symptoms and COVID Presence, 18 August 2020. Available online: https://www.kaggle.com/hemanthhari/symptoms-and-covid-presence/metadata (accessed on 27 May 2021).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- D’Angela, A. Why Weight? The Importance of Training on Balanced Datasets. Available online: https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-datasets-f1e54688e7df (accessed on 18 June 2021).
- Brownlee, J. How to Use Classification Machine Learning Algorithms in Weka. Available online: https://machinelearningmastery.com/use-classification-machine-learning-algorithms-weka/ (accessed on 18 June 2021).
- Quinlan, J.R. Learning decision tree classifiers. ACM Comput. Surv. 1996, 28, 71–72. [Google Scholar] [CrossRef]
- Kumar, N.; Khatri, S. Implementing WEKA for medical data classification and early disease prediction. In Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 9–10 February 2017; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2020, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
- Shah, C.; Jivani, A. Comparison of Data Mining Classification Algorithms for Breast Cancer Prediction. In Proceedings of the 4th ICCCNT 2013, Tiruchengode, India, 4–6 July 2013. [Google Scholar]
- Delizo, J.P.D.; Abisado, M.B.; Trinos, M.I.D. Philippine Twitter Sentiments during Covid-19 Pandemic using Multinomial Naïve-Bayes. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 408–412. [Google Scholar]
- Routledge, R. Bayes’s Theorem, 17 February 2018. Available online: https://www.britannica.com/topic/Bayess-theorem (accessed on 13 June 2021).
- Villavicencio, C.N.; Macrohon, J.J.E.; Inbaraj, X.; Jeng, J.-H.; Hsieh, J.-G. Twitter Sentiment Analysis towards COVID-19 Vaccines using Naive Bayes. Information 2021, 12, 204. [Google Scholar] [CrossRef]
- Chapelle, O.; Haffner, P.; Vapnik, V.N. Support Vector Machines for Histogram-Based Image Classification. IEEE Trans. Neural Netw. 2018, 10, 1055–1064. [Google Scholar] [CrossRef] [PubMed]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
- Sun, S.; Huang, R. An Adaptive k-Nearest Neighbor Algorithm. In Proceedings of the Seventh International Conference on Fuzzy Systems and Knowledge Discovery, Yantai, China, 10–12 August 2010. [Google Scholar]
- Raschka, S. What Is Euclidean Distance in Terms of Machine Learning? 2021. Available online: https://sebastianraschka.com/faq/docs/euclidean-distance.html (accessed on 28 May 2021).
- Ghoneim, S. Accuracy, Recall, Precision, F-Score & Specificity, Which to Optimize on? 2 April 2019. Available online: https://towardsdatascience.com/accuracy-recall-precision-f-score-specificity-which-to-optimize-on-867d3f11124 (accessed on 25 May 2021).
- Shah, M. IS Accuracy and Correctly Classified Instances Are Same. If Same Then Their Formulas Will Also Be Same Using Weka? 30 March 2017. Available online: https://www.researchgate.net/post/IS-accuracy-and-correctly-classified-instances-are-same-if-same-then-therir-formulas-will-also-be-same-using-weka (accessed on 25 May 2021).
- Pykes, K. Cohen’s Kappa, 27 February 2020. Available online: https://towardsdatascience.com/cohens-kappa-9786ceceab58 (accessed on 25 May 2021).
- Glen, S. Absolute Error & Mean Absolute Error (MAE), 25 October 2016. Available online: https://www.statisticshowto.com/absolute-error/ (accessed on 25 May 2021).
- Brownlee, J. Classification Accuracy Is Not Enough: More Performance Measures You Can Use, Machine Learning Mastery, 20 June 2019. Available online: https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/ (accessed on 27 May 2021).
- What Is the Influence of C in SVMs with Linear Kernel? Stack Exchange, 23 June 2012. Available online: https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel (accessed on 18 June 2021).
Reference | Disease Prediction | Machine Learning Algorithms | Best Algorithm | Accuracy Obtained |
---|---|---|---|---|
[9] | Coronary Artery Disease | LR, DT, RF, SVM, k-NN, ANN | ANN | 93.03% |
[10] | Respiratory Disease | RF, DT, MLP | DT | 99.41% |
[11] | Breast Cancer | SVM, DT, NB, and k-NN | SVM | 97.13% |
[12] | Diabetes | DT, SVM, and NB | NB | 76.30% |
[13] | Dementia | J48 DT, NB, RF, and MLP | J48 DT | 99.23% |
[14] | Fatty Liver Disease | LR, k-NN, DT, SVM, NB, and RF | LR | 75.00% |
Proposed Method | COVID-19 | J48 DT, NB, SVM, k-NN, and RF | SVM | 98.81% |
Attribute Name | Type | Percentage Level | Description |
---|---|---|---|
Breathing Problem | Nominal | 10% | The person is experiencing shortness of breath. |
Fever | Nominal | 10% | Temperature is above normal. |
Dry Cough | Nominal | 10% | Continuous coughing without phlegm. |
Sore Throat | Nominal | 10% | The person is experiencing sore throat. |
Running Nose | Nominal | 5% | The person is experiencing a runny nose. |
Asthma | Nominal | 4% | The person has asthma. |
Chronic Lung Disease | Nominal | 6% | The person has lung disease. |
Headache | Nominal | 4% | The person is experiencing headache. |
Heart Disease | Nominal | 2% | The person has cardiovascular disease. |
Diabetes | Nominal | 1% | The person is suffering from or has a history of diabetes. |
Hypertension | Nominal | 1% | Having a high blood pressure. |
Fatigue | Nominal | 2% | The person is experiencing tiredness. |
Gastrointestinal | Nominal | 1% | Having some gastrointestinal problems. |
Abroad Travel | Nominal | 8% | Recently went out of the country. |
Contact with COVID-19 Patient | Nominal | 8% | Had some close contact with people infected with COVID-19. |
Attended Large Gathering | Nominal | 6% | The person or anyone from their family recently attended a mass gathering. |
Visited Public Exposed Places | Nominal | 4% | Recently visited malls, temples, and other public places. |
Family Working in Public Exposed Places | Nominal | 4% | The person or anyone in their family is working in a market, hospital, or another crowded place. |
Wearing Masks | Nominal | 2% | The person is wearing face masks properly. |
Sanitation from Market | Nominal | 2% | Sanitizing products bought from market before use. |
COVID-19 | Nominal | - | The presence of COVID-19. |
Dataset Number | Technique Used | No. of Instances from the Yes Class | No. of Instances from the No Class |
---|---|---|---|
1 | - | 4383 | 1051 |
2 | SMOTE | 4383 | 2102 |
3 | Spread Subsample | 2102 | 2102 |
Training Number | Confidence Factor | Unpruned | Accuracy |
---|---|---|---|
1 | 0.25 | True | 98.57% |
2 | 0.50 | True | 98.57% |
3 | 0.75 | True | 98.57% |
4 | 0.25 | False | 98.45% |
5 | 0.50 | False | 98.55% |
6 | 0.75 | False | 98.55% |
Training Number | Bag Size | Accuracy |
---|---|---|
1 | 100 | 98.81% |
2 | 75 | 98.81% |
3 | 50 | 98.79% |
Training Number | C | Kernel | Accuracy |
---|---|---|---|
1 | 1 | Poly Kernel | 95.48% |
2 | 2 | Poly Kernel | 95.34% |
3 | 3 | Poly Kernel | 95.22% |
4 | 1 | Normalized Poly Kernel | 94.84% |
5 | 2 | Normalized Poly Kernel | 95.34% |
6 | 3 | Normalized Poly Kernel | 95.39% |
7 | 1 | Pearson VII | 98.81% |
8 | 2 | Pearson VII | 98.81% |
9 | 3 | Pearson VII | 98.81% |
Training Number | KNN | Cross Validate | Accuracy |
---|---|---|---|
1 | 1 | True | 98.69% |
2 | 3 | True | 98.69% |
3 | 7 | True | 98.69% |
4 | 1 | False | 98.69% |
5 | 3 | False | 97.57% |
6 | 7 | False | 94.53% |
Training Number | Use Kernel Estimator | Supervised Discretization | Accuracy |
---|---|---|---|
1 | False | False | 93.98% |
2 | True | False | 93.98% |
3 | False | True | 93.98% |
Algorithm | Accuracy | Precision | Recall | F-Measure |
---|---|---|---|---|
J48 DT | 98.57% | 0.986 | 0.986 | 0.986 |
RF | 98.81% | 0.988 | 0.988 | 0.988 |
SVM | 98.81% | 0.988 | 0.988 | 0.988 |
k-NN | 98.69% | 0.987 | 0.987 | 0.987 |
NB | 93.98% | 0.940 | 0.940 | 0.940 |
Algorithm | Correctly Classified Instances | Incorrectly Classified Instances | Kappa Statistic | Mean Absolute Error | Time in Seconds |
---|---|---|---|---|---|
J48 DT | 4144 | 60 | 0.972 | 0.024 | 0.03 |
RF | 4154 | 50 | 0.976 | 0.023 | 0.18 |
SVM | 4154 | 50 | 0.976 | 0.012 | 3.12 |
k-NN | 4149 | 55 | 0.973 | 0.022 | 0.01 |
NB | 3951 | 253 | 0.880 | 0.080 | 0.01 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Villavicencio, C.N.; Macrohon, J.J.E.; Inbaraj, X.A.; Jeng, J.-H.; Hsieh, J.-G. COVID-19 Prediction Applying Supervised Machine Learning Algorithms with Comparative Analysis Using WEKA. Algorithms 2021, 14, 201. https://doi.org/10.3390/a14070201
Villavicencio CN, Macrohon JJE, Inbaraj XA, Jeng J-H, Hsieh J-G. COVID-19 Prediction Applying Supervised Machine Learning Algorithms with Comparative Analysis Using WEKA. Algorithms. 2021; 14(7):201. https://doi.org/10.3390/a14070201
Chicago/Turabian StyleVillavicencio, Charlyn Nayve, Julio Jerison Escudero Macrohon, Xavier Alphonse Inbaraj, Jyh-Horng Jeng, and Jer-Guang Hsieh. 2021. "COVID-19 Prediction Applying Supervised Machine Learning Algorithms with Comparative Analysis Using WEKA" Algorithms 14, no. 7: 201. https://doi.org/10.3390/a14070201
APA StyleVillavicencio, C. N., Macrohon, J. J. E., Inbaraj, X. A., Jeng, J. -H., & Hsieh, J. -G. (2021). COVID-19 Prediction Applying Supervised Machine Learning Algorithms with Comparative Analysis Using WEKA. Algorithms, 14(7), 201. https://doi.org/10.3390/a14070201