Abstract
Most research concluded that machine learning performance is better when dealing with cleaned dataset compared to dirty dataset. In this paper, we experimented three weak or base machine learning classifiers: Decision Table, Naive Bayes and k-Nearest Neighbor to see their performance on real-world, noisy and messy clinical trial dataset rather than employing beautifully designed dataset. We involved the clinical trial data scientist in leading us to a better data analysis exploration and enhancing the performance result evaluation. The classifiers performances were analyzed using Accuracy and Receiver Operating Characteristic (ROC), supported with sensitivity, specificity and precision values which resulted to contradiction of conclusion made by previous research. We employed pre-processing techniques such as interquartile range technique to remove the outliers and mean imputation to handle missing values and these techniques resulted to; all three classifiers work better in dirty dataset compared to imputed and clean dataset by showing highest accuracy and ROC measure. Decision Table turns out to be the best classifier when dealing with real-world noisy clinical trial.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Rogers, S., Girolami, M.: A First Course in Machine Learning. CRC Press, Boca Raton (2015)
Simon, H.A.: Applications of Machine Learning and Rule Induction (1995)
Gamberger, D., Lavrač, N.: Noise detection and elimination applied to noise handling in KRK chess endgame. In: International Conference Inductive Logic Programming (1997)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., Frangakis, C., Hogan, J.W., Molenberghs, G., Murphy, S.A., Neaton, J.D., Rotnitzky, A., Scharfstein, D., Shih, W.J., Siegel, J.P., Stern, H.: The prevention and treatment of missing data in clinical trials. N. Engl. J. Med. 367(14), 1355–1360 (2012)
Grubbs, F.E.: Procedures for detecting outlying observations in samples (1974)
Gamberger, D., Lavrač, N., Duzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)
Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng. 68(12), 1513–1542 (2009)
Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: ICML, pp. 920–927 (2003)
Hall, M.A.: Correlation-based feature selection for machine learning. Methodology 21i195–i20, 1–5 (1999)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Zupan, B., Demšar, J., Kattan, M.W., Beck, J.R., Bratko, I.: Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artif. Intell. Med. 20(1), 59–75 (2000)
Kalapanidas, E., Avouris, N., Craciun, M., Neagu, D.: Machine learning algorithms: a study on noise sensitivity. In: Proceedings of the 1st Balcan Conference on Informatics, pp. 356–365, October 2003
Vannucci, M., Colla, V., Cateni, S.: An hybrid ensemble method based on data clustering and weak learners reliabilities estimated through neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2015. LNCS, vol. 9095, pp. 400–411. Springer, Heidelberg (2015)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)
Maclin, R., Opitz, D.: Popular Ensemble Methods: An Empirical Study, arXiv.org, vol. cs.AI, pp. 169–198 (2011)
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Kohavi, R.: The power of decision tables. In: Machine Learning, ECML 1995, pp. 174–189 (1995)
Wets, G., Vanthienen, J., Timmermans, H.: Modelling decision tables from data. In: Wu, X., Kotagiri, R., Korb, K.B. (eds.) PAKDD 1998. LNCS, vol. 1394. Springer, Heidelberg (1998)
John, G.H.G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, vol. 1, pp. 338–345 (1995)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)
Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39(4), 561–577 (1993)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques (Google eBook) (2011)
Li, M., Shang, C., Feng, S., Fan, J.: Quick attribute reduction in inconsistent decision tables. Inf. Sci. (Ny) 254, 155–180 (2014)
Tomar, D., Agarwal, S.: A survey on data mining approaches for healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013)
Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Miscellaneous Clustering Methods (2011)
Acknowledgement
This paper is a part of Master Dissertation Theses written in University of Manchester, UK. We would like to thank data scientists from Advance Analytics Centre, Astra Zeneca, Alderley Park, Chesire, UK for their review, support and suggestion on this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kamaru-Zaman, E.A., Brass, A., Weatherall, J., Rahman, S.A. (2016). Weak Classifiers Performance Measure in Handling Noisy Clinical Trial Data. In: Berry, M., Hj. Mohamed, A., Yap, B. (eds) Soft Computing in Data Science. SCDS 2016. Communications in Computer and Information Science, vol 652. Springer, Singapore. https://doi.org/10.1007/978-981-10-2777-2_13
Download citation
DOI: https://doi.org/10.1007/978-981-10-2777-2_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2776-5
Online ISBN: 978-981-10-2777-2
eBook Packages: Computer ScienceComputer Science (R0)