[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Published: 01 March 2021 Publication History

Abstract

Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications.

References

[1]
O’Brien R and Ishwaran H A random forests quantile classifier for class imbalanced data Pattern Recognit 2019 90 232-249
[2]
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 127–136
[3]
Estabrooks A, Jo T, and Japkowicz N A multiple resampling method for learning from imbalanced data sets Comput Intell 2004 20 1 18-36
[4]
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of European conference on machine learning. Springer, Berlin, pp 146–153
[5]
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155
[6]
Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP Smote synthetic minority over-sampling technique J Artif Intell Res 2002 16 321-357
[7]
Yen S-J and Lee Y-S Cluster-based under-sampling approaches for imbalanced data distributions Expert Syst Appl 2009 36 3 5718-5727
[8]
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, and Ahmad S An ensemble oversampling model for class imbalance problem in software defect prediction IEEE Access 2018 6 24184-24195
[9]
Bruzzone L and Serpico SB Classification of imbalanced remote-sensing data by neural networks Pattern Recognit Lett 1997 18 11–13 1323-1328
[10]
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, pp 935–942
[11]
Liu X-Y, Jianxin W, and Zhou Z-H Exploratory undersampling for class-imbalance learning IEEE Trans Syst Man Cybern Part B (Cybern) 2008 39 2 539-550
[12]
He H and Garcia EA Learning from imbalanced data IEEE Trans Knowl Data Eng 2009 21 9 1263-1284
[13]
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 1322–1328
[14]
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing. Springer, Berlin, pp 878–887
[15]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems, pp 2672–2680
[16]
Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: Proceedings of 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
[17]
Douzas G and Bacao F Effective data generation for imbalanced learning using conditional generative adversarial networks Expert Syst Appl 2018 91 464-471
[18]
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Proceedings of advances in neural information processing systems, pp 7335–7345
[19]
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
[20]
Xu L, Veeramachaneni K (2018) Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264
[21]
Breiman L Bagging predictors Mach Learn 1996 24 2 123-140
[22]
Hastie T, Rosset S, Zhu J, and Zou H Multi-class adaboost Stat. Interface 2009 2 3 349-360
[23]
Nguyen HM, Cooper EW, and Kamei K Borderline over-sampling for imbalanced data classification Int J Knowl Eng Soft Data Paradig 2011 3 1 4-21
[24]
Cortes C and Vapnik V Support-vector networks Mach Learn 1995 20 3 273-297
[25]
Jo T and Japkowicz N Class imbalances versus small disjuncts ACM Sigkdd Explor Newsl 2004 6 1 40-49
[26]
Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of 2008 19th international conference on pattern ecognition. IEEE, pp 1–4
[27]
Wang H-Y (2008) Combination approach of smote and biased-SVM for imbalanced datasets. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 228–231
[28]
Hoi C-H, Chan C-H, Huang K, Lyu MR, King I (2004) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of 2004 IEEE international joint conference on neural networks, vol 4. IEEE, pp 3189–3194
[29]
Batista GEAPA, Prati RC, and Monard MC A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD Explor Newsl 2004 6 1 20-29
[30]
Wilson DL Asymptotic properties of nearest neighbor rules using edited data IEEE Trans Syst Man Cybern 1972 3 408-421
[31]
Tomek I et al (1976) Two modifications of CNN
[32]
Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 107–118
[33]
Crammer K and Singer Y On the algorithmic implementation of multiclass kernel-based vector machines J Mach Learn Res 2001 2 Dec 265-292
[34]
Jacobson V Congestion avoidance and control ACM SIGCOMM Comput Commun Rev 1988 18 4 314-329
[35]
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, pp 1–6
[36]
Atilla Özgür and Hamit Erdem A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015 PeerJ Preprints 2016 4 e1954v1
[37]
Revathi S and Malathi A A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection Int J Eng Res Technol 2013 2 12 1848-1853
[38]
Fares AH, Sharawy MI, and Zayed HH Intrusion detection: supervised machine learning J Comput Sci Eng 2011 5 4 305-313
[39]
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning
[40]
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
[41]
Corder GW and Foreman DI Nonparametric statistics: a step-by-step approach 2014 New York Wiley
[42]
Lee K, Lim J, Bok K, and Yoo J Handling method of imbalance data for machine learning: focused on sampling J Korea Contents Assoc 2019 19 11 567-577

Cited By

View all
  • (2024)A Gaussian–Based WGAN–GP Oversampling Approach for Solving the Class Imbalance ProblemInternational Journal of Applied Mathematics and Computer Science10.61822/amcs-2024-002134:2(291-307)Online publication date: 1-Jun-2024
  • (2024)Future of generative adversarial networks (GAN) for anomaly detection in network securityComputers and Security10.1016/j.cose.2024.103733139:COnline publication date: 16-May-2024
  • (2024)Noise-free sampling with majority framework for an imbalanced classification problemKnowledge and Information Systems10.1007/s10115-024-02079-666:7(4011-4042)Online publication date: 1-Jul-2024
  • Show More Cited By

Index Terms

  1. Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Computing
          Computing  Volume 103, Issue 3
          Mar 2021
          174 pages

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 01 March 2021
          Accepted: 12 October 2020
          Received: 13 September 2020

          Author Tags

          1. Machine learning
          2. Oversampling
          3. Undersampling
          4. Imbalanced data
          5. TCP
          6. KDD99

          Author Tags

          1. 68T20
          2. 68P01
          3. 68M20
          4. 65Y04

          Qualifiers

          • Research-article

          Funding Sources

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 18 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)A Gaussian–Based WGAN–GP Oversampling Approach for Solving the Class Imbalance ProblemInternational Journal of Applied Mathematics and Computer Science10.61822/amcs-2024-002134:2(291-307)Online publication date: 1-Jun-2024
          • (2024)Future of generative adversarial networks (GAN) for anomaly detection in network securityComputers and Security10.1016/j.cose.2024.103733139:COnline publication date: 16-May-2024
          • (2024)Noise-free sampling with majority framework for an imbalanced classification problemKnowledge and Information Systems10.1007/s10115-024-02079-666:7(4011-4042)Online publication date: 1-Jul-2024
          • (2023)Network Traffic Classification Based on SD Sampling and Hierarchical Ensemble LearningSecurity and Communication Networks10.1155/2023/43743852023Online publication date: 1-Jan-2023

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media