Abstract
The development of Intrusion Detection Systems using Machine Learning techniques (ML-based IDS) has emerged as an important research topic in the cybersecurity field. However, there is a noticeable absence of systematic studies to comprehend the usability of such systems in real-world applications. This paper analyzes the impact of data preprocessing techniques on the performance of ML-based IDS using two public datasets, UNSW-NB15 and CIC-IDS2017. Specifically, we evaluated the effects of data cleaning, encoding, and normalization techniques on the performance of binary and multiclass intrusion detection models. This work investigates the impact of data preprocessing techniques on the performance of ML-based IDS and how the performance of different ML-based IDS is affected by data preprocessing techniques. To this end, we implemented a machine learning pipeline to apply the data preprocessing techniques in different scenarios to answer such questions. The findings analyzed using the Friedman statistical test and Nemenyi post-hoc test revealed significant differences in groups of data preprocessing techniques and ML-based IDS, according to the evaluation metrics. However, these differences were not observed in multiclass scenarios for data preprocessing techniques. Additionally, ML-based IDS exhibited varying performances in binary and multiclass classifications. Therefore, our investigation presents insights into the efficacy of different data preprocessing techniques for building robust and accurate intrusion detection models.
Similar content being viewed by others
Data Availability
All materials used in this manuscript are public, and no permission is required. The results and data in this manuscript have not been published elsewhere.
Code Availability
All materials used in this manuscript are public, and no permission is required. Additional materials for this article are available at the following link: https://github.com/kelsonc/evaluation-preprocessing-methods
Notes
Scikit-learn. <https://scikit-learn.org/> Accessed on 16 May 2023.
Pandas. <https://pandas.pydata.org/> Accessed on 16 May 2023.
ACSC, Australian Cyber Security Centre. <https://www.cyber.gov.au/> Accessed on 16 May 2023.
Bro. <https://zeek.org/> Accessed on 16 May 2023.
Argus. <https://openargus.org/> Accessed on 16 May 2023.
TCPDump. <https://www.tcpdump.org/> Accessed on 16 May 2023.
CIC, Canadian Institute for Cybersecurity. <https://www.unb.ca/cic/> Accessed on 16 May 2023.
CICFlowMeter. <https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter> Accessed on 16 May 2023.
Overfitting is a prevalent issue in machine learning, wherein the model effectively learns from the training data but also captures the inherent noise. As a result, the overfitted model tends to perform poorly on new data. This is because the model has memorized the training set instead of discerning general patterns applicable to novel instances [44].
Underfitting occurs when the model needs more complexity to comprehend the nuances and intricacies of the dataset. As a result, the model fails to adequately adapt to the training data, leading to heightened rates of false positives and false negatives. This limitation compromises the model’s ability to detect and classify network intrusions accurately [44].
Google LLC, Kaggle. <https://www.kaggle.com/> Accessed on 16 May 2023.
References
International Telecommunication Union: Global Cybersecurity Index 2020: Measuring Commitment to Cybersecurity, 1st edn. ITUPublications, Geneva (2021)
Sarker, I.H., Kayes, A., Badsha, S., Alqahtani, H., Watters, P., Ng, A.: Cybersecurity data science: an overview from machine learning perspective. J. Big Data 7(1), 1–29 (2020). https://doi.org/10.1186/s40537-020-00318-5
Szczypiorski, K.: Cybersecurity and data science. Electronics 11(15), 1–4 (2022). https://doi.org/10.3390/electronics11152309
Hajj, S., El Sibai, R., Bou Abdo, J., Demerjian, J., Makhoul, A., Guyeux, C.: Anomaly-based intrusion detection systems: the requirements, methods, measurements, and datasets. Trans. Emerging Telecommun. Technol. 32(4), 1–36 (2021). https://doi.org/10.1002/ett.4240
Putra, W., Huang, J.J.: A survey of intrusion detection system. Int. J. Informatics Comput. 1(1), 1–19 (2019). https://doi.org/10.35842/ijicom.v1i1.7
Hubballi, N., Suryanarayanan, V.: False alarm minimization techniques in signature-based intrusion detection systems: a survey. Comput. Commun. 49, 1–17 (2014). https://doi.org/10.1016/j.comcom.2014.04.012
Skopik, F., Wurzenberger, M., Landauer, M.: The seven golden principles of effective anomaly-based intrusion detection. IEEE Secur. Privacy 19(5), 36–45 (2021). https://doi.org/10.1109/MSEC.2021.3090444
Kunal, Dua, M.: Machine Learning Approach to IDS: A Comprehensive Review. Paper presented at the 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 12-14 June 2019 (2019). https://doi.org/10.1109/ICECA.2019.8822120
Thakkar, A., Lohiya, R.: A survey on intrusion detection system: feature selection, model, performance measures, application perspective, challenges, and future research directions. Artif. Intell. Rev. 55, 453–563 (2022). https://doi.org/10.1007/s10462-021-10037-9
Ahmad, T., Aziz, M.N.: Data preprocessing and feature selection for machine learning intrusion detection systems. ICIC Express Lett 13(2), 93–101 (2019). https://doi.org/10.24507/icicel.13.02.93
Obaid, H.S., Dheyab, S.A., Sabry, S.S.: The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning. Paper presented at the 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India, 13–15 March 2019 (2019). https://doi.org/10.1109/IEMECONX.2019.8877011
Li, C.: Preprocessing methods and pipelines of data mining: An overview. arXiv preprint, 1–7 (2019) https://doi.org/10.48550/arXiv.1906.08510arXiv:1906.08510 [[s.n.]]
Paulauskas, N., Auskalnis, J.: Analysis of data pre-processing influence on intrusion detection using NSL-KDD dataset (2017). https://doi.org/10.1109/eStream.2017.7950325
Davis, J.J., Clark, A.J.: Data preprocessing for anomaly based network intrusion detection: a review. Comput. Secur. 30(6), 353–375 (2011). https://doi.org/10.1016/j.cose.2011.05.008
Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning approaches. Appl. Sci. 10(5), 1–21 (2020). https://doi.org/10.3390/app10051775
Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Improving the reliability of network intrusion detection systems through dataset integration. IEEE Trans. Emerging Topics Comput. 10(4), 1717–1732 (2022). https://doi.org/10.1109/TETC.2022.3178283
Singh, D., Singh, B.: Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 1–23 (2020). https://doi.org/10.1016/j.asoc.2019.105524
Molina-Coronado, B., Mori, U., Mendiburu, A., Miguel-Alonso, J.: Survey of network intrusion detection methods from the perspective of the knowledge discovery in databases process. IEEE Trans. Network Serv. Manag. 17(4), 2451–2479 (2020). https://doi.org/10.1109/TNSM.2020.3016246
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., Saeed, J.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appli. Sci. Technol. Trends 1(2), 56–70 (2020). https://doi.org/10.38094/jastt1224
Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. (CSUR) 54(9), 1–36 (2021). https://doi.org/10.1145/3472753
Al-Utaibi, K.A., El-Alfy, E.M.: Intrusion detection taxonomy and data preprocessing mechanisms. J. Intell. Fuzzy Syst. 34(3), 1369–1383 (2018). https://doi.org/10.3233/JIFS-16943
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010). https://doi.org/10.1007/s10462-009-9124-7
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 374(2065), 1–16 (2016). https://doi.org/10.1098/rsta.2015.0202
Izenman, A.J.: Linear discriminant analysis. In: Modern Multivariate Statistical Techniques, pp. 237–280. Springer, New York (2013). https://doi.org/10.1007/978-0-387-78189-1_8
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. Paper presented at the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 08-10 July 2009 (2009). https://doi.org/10.1109/CISDA.2009.5356528
Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). Paper presented at the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10-12 November 2015 (2015). https://doi.org/10.1109/MilCIS.2015.7348942
Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation. Paper presented at the first Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Salzburg Austria, 10–10 April 2011 (2011). https://doi.org/10.1145/1978672.1978676
Kennedy, J., Eberhart, R.: Particle swarm optimization. Paper presented at the International Conference on Neural Networks (ICNN’95), Perth, WA, Australia, 27 November – 01 December 1995 (1995). https://doi.org/10.1109/ICNN.1995.488968
Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand (1999). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sci. 180(10), 2044–2064 (2010). https://doi.org/10.1016/j.ins.2009.12.010
Güney, H.: Preprocessing impact analysis for machine learning-based network intrusion detection. Sakarya Univ. J. Comput. Information Sci. 6(1), 67–79 (2023). https://doi.org/10.35377/saucis...1223054
Liu, Q., Chen, C., Zhang, Y., Hu, Z.: Feature selection for support vector machines with rbf kernel. Artif. Intell. Rev. 36(2), 99–115 (2011). https://doi.org/10.1007/s10462-011-9205-2
Ketepalli, G., Bulla, P.: Data Preparation and Pre-processing of Intrusion Detection Datasets using Machine Learning. Paper presented at the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26-28 April 2023 (2023). https://doi.org/10.1109/ICICT57646.2023.10134025
Symeonidis, S., Effrosynidis, D., Arampatzis, A.: A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Exp. Syst. Appl. 110, 298–310 (2018). https://doi.org/10.1016/j.eswa.2018.06.022
Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649. Springer, New York (2020). https://doi.org/10.1007/978-81-322-3972-7_19
Frye, M., Mohren, J., Schmitt, R.H.: Benchmarking of data preprocessing methods for machine learning-applications in production. Procedia CIRP 104, 50–55 (2021). https://doi.org/10.1016/j.procir.2021.11.009
Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmarking of machine learning for anomaly based intrusion detection systems in the cicids2017 dataset. IEEE Access 9, 22351–22370 (2021). https://doi.org/10.1109/ACESSO.2021.3056614
Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A.: A survey of network-based intrusion detection data sets. Comput. Secur. 86, 147–167 (2019). https://doi.org/10.1016/j.cose.2019.06.005
Claise, B., Trammell, B., Aitken, P.: Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information (RFC7011). Retrieved from https://www.rfc-editor.org/info/rfc7011. Accessed on 16 May 2023 (2013). https://doi.org/10.17487/rfc7011
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. Paper presented at the 4th International Conference on Information Systems Security and Privacy (ICISSp), Funchal, Madeira, Portugal, 22–24 January 2018 (2018). https://doi.org/10.5220/0006639801080116
Hindy, H., Brosset, D., Bayne, E., Seeam, A.K., Tachtatzis, C., Atkinson, R., Bellekens, X.: A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8, 104650–104675 (2020). https://doi.org/10.1109/ACCESS.2020.3000179
Gharib, A., Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: An Evaluation Framework for Intrusion Detection Dataset. Paper presented at the International Conference on Information Science and Security (ICISS 2016), Pattaya, Thailand, 19-22 December 2016 (2016). https://doi.org/10.1109/ICISSEC.2016.7885840
Koehrsen, W.: Overfitting vs. underfitting: a complete example. Towards Data Sci. 405, 1–12 (2018)
Kramer, O.: Dimensionality Reduction with Unsupervised Nearest Neighbors, 1st edn. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38652-7
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. Paper presented at the 22nd acm sigkdd international conference on knowledge discovery and data mining, San Francisco, California, USA, 13–17 August 2013 (2016). https://doi.org/10.1145/2939672.2939785
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.: Lightgbm: A highly efficient gradient boosting decision tree. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017 (2017). [s.n.]
Resende, P.A.A., Drummond, A.C.: A survey of random forest based methods for intrusion detection systems. ACM Comput. Surv. (CSUR) 51(3), 1–36 (2018). https://doi.org/10.1145/3178582
Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cybern. 10, 2823–2836 (2019). https://doi.org/10.1007/s13042-018-00906-1
Nemenyi, P.B.: Distribution-free multiple comparisons. PhD thesis, University of Princeton, Department of Mathematics, Princeton, New Jersey, US (1963). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy
Funding
The authors thank the State of Minas Gerais Research Support Foundation-FAPEMIG (Grant APQ-02196-18) for financial support. The authors also acknowledge the financial support of the Brazilian National Council for Scientific and Technological Development (CNPq), Grant 421944/2021-8.
Author information
Authors and Affiliations
Contributions
The authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests or other interests that might be perceived to influence the results or discussion reported in this paper.
Ethical Approval
This manuscript adheres to the principles and policies of authorship ethics.
Consent to Participate and Publication
All authors read and approved the final manuscript for publication via the subscription publishing route.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Santos, K.C., Miani, R.S. & de Oliveira Silva, F. Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems. J Netw Syst Manage 32, 36 (2024). https://doi.org/10.1007/s10922-024-09813-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10922-024-09813-z