Abstract
Recently, a variety of mobile security threats have been emerged due to the exponential growth in mobile technologies. Various techniques have been developed to address the risks associated with malware. The most popular method to detect Android malware relies on the signature-based method. The drawback of this method is that it is unable to detect unknown malware. Due to this problem, machine learning came into existence for detecting and classifying malware applications. The conventional machine learning algorithms focus on optimizing classification accuracy. However, the imbalanced real-life datasets cause the traditional classification algorithm to perform poorly in classifying malicious apps. To handle the problem of imbalanced family classification of malicious applications, we propose a Cost-Sensitive Forest (CSForest) method which contains a group of decision trees. A cost-sensitive voting technique is used for prediction purposes. The proposed approach is evaluated on a dataset that includes the features extracted from both static and dynamic malware analysis and consisting of 13 imbalanced families of Android malware. Furthermore, the results of proposed technique are compared with the C4.5, Random Forest and CSTree to determine its effectiveness in classifying the families of malicious applications while considering only static features, only dynamic features and their hybrid. From the experimental results, it is found that CSForest performs better than the other algorithms in handling the imbalanced family classification of Android malicious applications while considering the hybrid set of features. It acquires the highest F-measure rate i.e. 0.919 with a minimum total cost of 180.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Singla S, Gandotra E, Bansal D, Sofat S (2015) Detecting and classifying morphed malwares: a survey. Int J Comput Appl 122(10):28–33
Gandotra E, Singla S, Bansal D, Sofat S (2018) Clustering morphed malware using opcode sequence pattern matching. Recent Pat Eng 12(1):30–36
Kouliaridis V, Barmpatsalou K, Kambourakis G, Chen S (2020) A survey on mobile malware detection techniques. IEICE Trans Inf Syst 103(2):204–211
Aslan OA, Samet R (2020) A comprehensive review on malware detection approaches. IEEE Access 8:6249–6271
Barrera D, Kayacik HG, Oorschot PCV, Somayaji A (2010) A methodology forempirical analysis of permission-based security models and its application toAndroid. in: Proc. of 17th ACM Conf. computer and communications security, CCS 10 pp.73–84.
Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Secur Appl 5(2):56–64
Dhalaria M, Gandotra E (2021) Android malware detection techniques: a literature review. Recent Pat Eng 15(2):225–245. https://doi.org/10.2174/1872212114999200710143847
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recogn 23(4):687–719
García V, Mollineda RA, Sánchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
Chen XW, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 124–132
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proc. Fourteenth Conf. Canadian Soc. for computational studies of intelligence, Ottawa, Canada, pp. 67–77
Krawczyk B, Jeleń L, Krzyżak A, Fevens (2012) Oversampling methods for classification of imbalanced breast cancer malignancy data. In: International Conference on computer vision and graphics, Springer, Berlin, Heidelberg, pp. 483-490
Zmyślony M, Krawczyk B, Woźniak M (2013) Combined classifiers with neural fuser for spam detection. In: International Joint Conference CISIS’12-ICEUTE´ 12-SOCO’ 12 special sessions, Springer, Berlin, Heidelberg, pp. 245-252
Yang Z, Tang WH, Shintemirov A, Wu QH (2009) Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern B 39(6):597–610
López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608
Haixiang G, Yijing L, Yanan L, Xiao L, Jinling L (2016) BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng Appl Artif Intell 49:176–193
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Barandela R, Valdovinos RM, Sánchez JS, Ferri FJ (2004) The imbalanced training sample problem: under or over sampling?. In: Joint IAPR International Workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), Springer, Berlin, Heidelberg, pp 806–814
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, pp 1–299
Islam MZ, Giggins H (2011) Knowledge discovery through sysfor: a systematically developed forest of multiple decision trees. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, Australian Computer Society, pp. 195–204.
Sheng VS, Ling CX (2006) Thresholding for making classifiers cost-sensitive. In: Proceedings of the National Conference on artificial intelligence, vol. 21, AAAI Press, MIT Press, Menlo Park, Cambridge, pp. 476–48.
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Ling CX, Sheng VS, Bruckhaus T, Madhavji NH (2006) Maximum profit mining and its application in software development. In: Proceedings of the 12th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 929–934
Sheng VS, Gu B, Fang W, Wu J (2014) Cost-sensitive learning for defect escalation. Knowl Based Syst 66:146–155
Cen L, Gates CS, Si L, Li N (2014) A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Trans Depend Secure 12(4):400–412
Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: International Conference on security and privacy in communication systems, Springer, Cham, pp 86–10
Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PC, Álvarez G (2013) Puma: permission usage to detect malware in android. In: International Joint Conference CISIS’12-ICEUTE 12-SOCO 12 special sessions, Springer, Berlin, Heidelberg, pp 289–298
Jang JW, Kang H, Woo J, Mohaisen A, Kim HK (2015) Andro-AutoPsy: anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Invest 14:17–35
Gupta D, Rani R (2020) Improving malware detection using big data and ensemble learning. Comput Electr Eng 86:106729. https://doi.org/10.1016/j.compeleceng.2020.106729
Gupta D, Rani R (2019) A study of big data evolution and research challenges. J Inf Sci 45(3):322–340
Xu Y, Wu C, Zheng K, Niu X, Yang Y (2017) Fuzzy–synthetic minority oversampling technique: oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets. Int J Distrib Sens N 13(4):1–15
Oak R, Du M, Yan D, Takawale H, Amit I (2019) Malware detection on highly imbalanced data through sequence modeling. In: Proceedings of the 12th ACM Workshop on artificial intelligence and security, pp 37–48
Sahin Y, Bulkan S, Duman E (2013) A cost-sensitive decision tree approach for fraud detection. Expert Syst Appl 40(15):5916–5923
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562
Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on principles of data mining and knowledge discovery, Cavtat, Croatia, Berlin, Heidelberg: Springer, pp 107–119
Qiong GU, Ming WX, Zhao WU et al (2016) An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification. J Digit Inf Manag 14(2):93–103
Ebenuwa SH, Sharif MS, Al-Nemrat A, Al-Bayatti AH, Alalwan N, Alzahrani AI, Alfarraj O (2019) Variance ranking for multi-classed imbalanced datasets: a case study of one-versus-all. Symmetry 11(12):1504. https://doi.org/10.3390/sym11121504
Siers MJ, and Islam MZ (2014) Cost sensitive decision forest and voting for software defect prediction. In: Pacific Rim International Conference on artificial intelligence, Springer, Cham, pp 929–936
Virusshare (2019) https://virusshare.com/. Accessed 2 Mar 2019
Avira (2019) https://www.avira.com/. Accessed 27 Apr 2019
Enck W, Octeau D, McDaniel PD, Chaudhuri S (2011) A study of android application security. USENIX Secur Symp 2(2):1–38
Android4me: J2ME port of Google’s Android (2011) https://code.google.com/p/android4me/downloads/list. Accessed 16 May 2019
Gandotra E, Bansal D, Sofat S (2016) Tools & techniques for malware analysis and classification. Int J New Gener Comput 7(3):176–197
CuckooDroid (2019) [Online]. https://cuckoo-droid.readthedocs.io/en/latest/installation/. Accessed 5 Oct 2019
Dhalaria M, Gandotra E (2020) A hybrid approach for android malware detection and family classification. Int J Interact Multi (IJIMAI). https://doi.org/10.9781/ijimai.2020.09.001.[InPress]
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dhalaria, M., Gandotra, E. CSForest: an approach for imbalanced family classification of android malicious applications. Int. j. inf. tecnol. 13, 1059–1071 (2021). https://doi.org/10.1007/s41870-021-00661-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-021-00661-7